我打算将测试和训练数据集显式提供给算法,而不是使用train_test_split方法将数据分别随机分为测试和训练。
我想在测试和训练模型时将评论和标签数据保存在同一文件中。
任何人都可以就同一件事建议我吗...
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix
with open("/Users/xyz/Desktop/reviews.txt") as f:
reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
labels = f.read().split("\n")
reviews_tokens = [review.split() for review in reviews]
onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)
X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)
lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)
print("Accuracy score of SVM:" , accuracy_score)
review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative
review,label
The picture is clear and beautiful,positive
Picture is not clear,negative
答案 0 :(得分:0)
只需执行您想要的操作即可。解决方案非常简单:
X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]
对y_train
和y_test
做同样的事情。
当然,您需要知道文件中的哪些行用于培训,哪些行用于测试。
如果要将要素和标签保留在同一文件中-没问题。您将需要一个额外的步骤来将标签与要素分开。熊猫会容易得多。
编辑
拥有您提供的文件后,您可以像这样获得所需的信息:
def load_data(filename):
X = list()
y = list()
with open(filename) as file:
file.readline()
for line in file:
line = line.strip().split(',')
y.append(line[1])
X.append(line[0].split())
return X, y
X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')