Nov-07-2019, 03:26 AM
I am working on a small test data. I am getting a ValueError: Found input variables with inconsistent numbers of samples: [5, 6]. How can I make the X and y shapes to be the same size. I added the line;
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
(5, 9)
(6,)
Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
dataset.dropna(inplace=True)
to drop NA values so that the two samples become the same size. However I still get the Value Error. The code is;
# Importing Libraries
import numpy as np
import pandas as pd
# Import dataset
dataset = pd.read_csv("../output.tsv", delimiter = '\t')
# library to clean data
import re
# Natural Language Tool Kit
import nltk
nltk.download('stopwords')
# to remove stopword
from nltk.corpus import stopwords
# for Stemming propose
from nltk.stem.porter import PorterStemmer
# Initialize empty array
# to append clean text
corpus = []
# 1000 (reviews) rows to clean
for i in range(0, 5):
# column : "Review", row ith
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
# convert all cases to lower cases
review = review.lower()
# split to array(default delimiter is " ")
review = review.split()
# creating PorterStemmer object to
# take main stem of each word
ps = PorterStemmer()
# loop for stemming each word
# in string array at ith row
review = [ps.stem(word) for word in review
if not word in set(stopwords.words('english'))]
# rejoin all string array elements
# to create back into a string
review = ' '.join(review)
# append each string to create
# array of clean text
corpus.append(review)
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
# To extract max 1500 feature.
# "max_features" is attribute to
# experiment with to get better results
cv = CountVectorizer(max_features = 9)
# X contains corpus (dependent variable)
X = cv.fit_transform(corpus).toarray()
# y contains answers if review
# is positive or negative
y = dataset.iloc[:, 1].values
# Splitting the dataset into
# the Training set and Test set
from sklearn.model_selection import train_test_split
dataset.dropna(inplace=True)
print(X.shape)
print(y.shape)
# experiment with "test_size"
# to get better results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)
print(X_train.shape)
print(y_train.shape)The Output from the code (for X shape and y shape) is(5, 9)
(6,)
Error is ValueError: Found input variables with inconsistent numbers of samples: [5, 6]
