如何导入csv文件作为带有标签的训练并使用scikit-learn中的分类器的目标数据进行测试?

时间:2017-08-20 07:47:41

标签: python csv scikit-learn classification

我有两个用于训练和测试数据的csv文件。它们都看起来像这样(我只显示其中一个,但它们都是相同形式的数据和相同的属性名称):

Full,Id,Id & PPDB,Id & Words Sequence,Id & Synonyms,Id & Hypernyms,Id & Hyponyms,Gold Standard
1.667,0.476,0.952,0.476,1.429,0.952,0.476,2.345
3.056,1.111,1.667,1.111,3.056,1.389,1.111,1.9
1.765,1.176,1.176,1.176,1.765,1.176,1.176,2.2
0.714,0.714,0.714,0.714,0.714,0.714,0.714,0.0
1.538,0.769,0.769,0.769,1.538,0.769,0.769,2.586
2.188,1.875,1.875,1.875,1.875,2.188,1.875,1.667
3.333,1.333,1.333,1.333,3.333,2.0,1.333,2.8
2.5,1.667,1.667,1.667,2.222,1.944,1.667,2.481

我是scikit-learn的新手。我学习了培训+标签和测试+目标数据输入的例子是这样的:

X_train = np.array(["new york is a hell of a town",
                    "new york was originally dutch",
                    "the big apple is great",
                    "new york is also called the big apple",
                    "nyc is nice",
                    "people abbreviate new york city as nyc",
                    "the capital of great britain is london",
                    "london is in the uk",
                    "london is in england",
                    "london is in great britain",
                    "it rains a lot in london",
                    "london hosts the british museum",
                    "new york is great and so is london",
                    "i like london better than new york"])
y_train_text = [["new york"],["new york"],["new york"],["new york"],["new york"],
                ["new york"],["london"],["london"],["london"],["london"],
                ["london"],["london"],["new york","london"],["new york","london"]]

X_test = np.array(['nice day in nyc',
                   'welcome to london',
                   'london is rainy',
                   'it is raining in britian',
                   'it is raining in britian and the big apple',
                   'it is raining in britian and nyc',
                   'hello welcome to new york. enjoy it here and london too'])
target_names = ['New York', 'London']

是否可以导入包含浮点数的csv文件作为带标签的训练并使用目标作为数据输入进行测试?另外,我想将Gold Standard属性作为我的训练数据标签和测试数据的目标。如果有可能,如何进行输入?感谢

1 个答案:

答案 0 :(得分:0)

根据@Vivek Kumar的评论,你可以通过使用熊猫来完成工作。 csv_readiloc是这样的:

In [12]: import pandas as pd

In [13]: import numpy as np

In [14]: df = pd.read_csv('train.txt')

In [15]: X_train = np.asarray(df.iloc[:, :-1])

In [16]: y_train = np.asarray(df.iloc[:, -1])

In [17]: X_train
Out[17]: 
array([[ 1.667,  0.476,  0.952, ...,  1.429,  0.952,  0.476],
       [ 3.056,  1.111,  1.667, ...,  3.056,  1.389,  1.111],
       [ 1.765,  1.176,  1.176, ...,  1.765,  1.176,  1.176],
       ..., 
       [ 2.188,  1.875,  1.875, ...,  1.875,  2.188,  1.875],
       [ 3.333,  1.333,  1.333, ...,  3.333,  2.   ,  1.333],
       [ 2.5  ,  1.667,  1.667, ...,  2.222,  1.944,  1.667]])

In [18]: y_train
Out[18]: array([ 2.345,  1.9  ,  2.2  ,  0.   ,  2.586,  1.667,  2.8  ,  2.481])

请注意我之前已将您提供的数据保存到文件train.txt