尝试在下面应用此代码时,出现以下错误: 我正在根据此页面进行教学:https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184
File "reviewsML.py", line 58, in <module>
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.50)
….
ValueError: Found input variables with inconsistent numbers of samples: [25707, 25000]
这是代码的一部分
reviews_train = []
for line in codecs.open('movie_data/full_train.txt', 'r', 'utf-8'):
reviews_train.append(line.strip())
reviews_test = []
for line in codecs.open('movie_data/full_test.txt', 'r', 'utf-8'):
reviews_test.append(line.strip())
REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
def preprocess_reviews(reviews):
reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
return reviews
reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)
print(len(reviews_train_clean))
from sklearn.feature_extraction.text import CountVectorizer
#construction of the classfier : hyperparameter c => adjusts the regularization
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(binary=True)
cv.fit(reviews_train_clean)
X = cv.transform(reviews_train_clean) #dimensionality reduction, return transformed data
X_test = cv.transform(reviews_test_clean)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
target = [1 if i < 12500 else 0 for i in range(25000)]
X_train, X_val, y_train, y_val = train_test_split(X, target, train_size = 0.75)
for c in [0.01, 0.05, 0.25, 0.5, 1]:
lr = LogisticRegression(C=c)
lr.fit(X_train, y_train)
print ("Accuracy for C=%s: %s"
% (c, accuracy_score(y_val, lr.predict(X_val))))
你知道我在做什么错吗?
我尝试打印(X.shape [0]) 它给了我25707
但是我不知道为什么因为原始文件包含25000的火车和测试内容