我们如何提供明确的测试数据并将数据训练到SVM而不是使用train_test_split函数?

时间:2018-09-07 07:11:07

标签: python scikit-learn svm

我打算将测试和训练数据集显式提供给算法,而不是使用train_test_split方法将数据分别随机分为测试和训练。

我想在测试和训练模型时将评论和标签数据保存在同一文件中。

任何人都可以就同一件事建议我吗...

我的代码:

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVC
from sklearn.metrics import average_precision_score
from sklearn.metrics import confusion_matrix

with open("/Users/xyz/Desktop/reviews.txt") as f:
    reviews = f.read().split("\n")
with open("/Users/xyz/Desktop/labels.txt") as f:
    labels = f.read().split("\n")

reviews_tokens = [review.split() for review in reviews]


onehot_enc = MultiLabelBinarizer()
onehot_enc.fit(reviews_tokens)


X_train, X_test, y_train, y_test = train_test_split(reviews_tokens, labels, test_size=0.20, random_state=None)

lsvm = LinearSVC()
lsvm.fit(onehot_enc.transform(X_train), y_train)
accuracy_score = lsvm.score(onehot_enc.transform(X_test), y_test)

print("Accuracy score of SVM:" , accuracy_score)

Test.txt

review,label
Colors & clarity is superb,positive
Sadly the picture is not nearly as clear or bright as my 40 inch Samsung,negative

Train.txt:

review,label
The picture is clear and beautiful,positive
Picture is not clear,negative

1 个答案:

答案 0 :(得分:0)

只需执行您想要的操作即可。解决方案非常简单:

X_train = reviews_tokens[:number_of_rows_of_train_data]
X_test = reviews_tokens[number_of_rows_of_train_data:]

y_trainy_test做同样的事情。

当然,您需要知道文件中的哪些行用于培训,哪些行用于测试。

如果要将要素和标签保留在同一文件中-没问题。您将需要一个额外的步骤来将标签与要素分开。熊猫会容易得多。

编辑

拥有您提供的文件后,您可以像这样获得所需的信息:

def load_data(filename):

    X = list()
    y = list()
    with open(filename) as file:
        file.readline()
        for line in file:
            line = line.strip().split(',')
            y.append(line[1])
            X.append(line[0].split())

    return X, y

X_train, y_train = load_data('train.txt')
X_test, y_test = load_data('test.txt')