我的代码：

Question

我打算将测试和训练数据集显式提供给算法，而不是使用train_test_split方法将数据分别随机分为测试和训练。

我想在测试和训练模型时将评论和标签数据保存在同一文件中。

任何人都可以就同一件事建议我吗...

我的代码：

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix

with open("/Users/xyz/Desktop/reviews.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]


onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)

lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)

print("Accuracy score of SVM:" , accuracy_score)

Test.txt

review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

Train.txt：

review,label
The picture is clear and beautiful,positive
Picture is not clear,negative

Answer 1

只需执行您想要的操作即可。解决方案非常简单：

X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]

对y_train和y_test做同样的事情。

当然，您需要知道文件中的哪些行用于培训，哪些行用于测试。

如果要将要素和标签保留在同一文件中-没问题。您将需要一个额外的步骤来将标签与要素分开。熊猫会容易得多。

编辑

拥有您提供的文件后，您可以像这样获得所需的信息：

def load_data(filename):

    X = list()
    y = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            y.append(line[1])
            X.append(line[0].split())

    return X, y

X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')

我们如何提供明确的测试数据并将数据训练到SVM而不是使用train_test_split函数？

我的代码：

Test.txt

Train.txt：

1 个答案: