ValueError:找到样本数量不一致的输入变量:[25707,25000]

时间:2019-06-03 14:32:41

标签: machine-learning scikit-learn python-3.6 linear-regression train-test-split

尝试在下面应用此代码时,出现以下错误: 我正在根据此页面进行教学:https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184


 File "reviewsML.py", line 58, in <module>
    X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
….
ValueError: Found input variables with inconsistent numbers of samples: [25707, 25000]

这是代码的一部分

reviews_train = []
for line in codecs.open('movie_data/full_train.txt', 'r', 'utf-8'):
    reviews_train.append(line.strip())

reviews_test = []
for line in codecs.open('movie_data/full_test.txt', 'r', 'utf-8'):
    reviews_test.append(line.strip())

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")

REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")



def preprocess_reviews(reviews):

    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]

    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]

    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)
print(len(reviews_train_clean))

from sklearn.feature_extraction.text import CountVectorizer
#construction of the classfier :  hyperparameter c => adjusts the regularization
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean) #dimensionality reduction, return transformed data
X_test = cv.transform(reviews_test_clean)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

target = [1 if i < 12500 else 0 for i in range(25000)]
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)
for c in [0.01, 0.05, 0.25, 0.5, 1]:

    lr = LogisticRegression(C=c)

    lr.fit(X_train, y_train)

    print ("Accuracy for C=%s: %s"


            % (c, accuracy_score(y_val, lr.predict(X_val))))


你知道我在做什么错吗?

我尝试打印(X.shape [0]) 它给了我25707

但是我不知道为什么因为原始文件包含25000的火车和测试内容

0 个答案:

没有答案