如何将csv或arff导入scikit?

时间:2018-02-01 23:02:19

标签: python csv scikit-learn arff

我有两个csv和arff格式的数据集,我一直在weka的分类模型中使用。我想知道这些格式是否可以在scikit中用于在python中尝试其他分类方法。

这就是我的数据集的样子: ASSAY_CHEMBLID ... MDEN.23 ... MA,TARGET_TYPE ......不...... MA,TARGET_TYPE ... ... APOL MA,TARGET_TYPE ... ATSm5 ... MA,TARGET_TYPE ... SCH.6。 ..MA,TARGET_TYPE ... SPC.6 ... MA,TARGET_TYPE ... SP.3 ... MA,TARGET_TYPE ... MDEN.12 ... MA,TARGET_TYPE ... MDEN.22 ... MA,TARGET_TYPE ... MLogP ... MA,TARGET_TYPE ... R ... MA,TARGET_TYPE ... ...摹MA,TARGET_TYPE ......我...... MA,机体...不... MA,机体... ... C2SP1 MA,机体... ... VC.6 MA,机体... ...化工网MA,机体... ... khs.aasC MA,机体... MDEC.12 ... MA,机体... ... MDEC.13 MA,机体... ... MDEC.23 MA,机体... ... MDEC.33 MA,机体... MDEO。 11 ... MA,机体... ... MDEN.22 MA,机体... ... topoShape MA,机体... ... WPATH MA,机体。P ... MA,LIJ 0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089, 0,0.209406,0

其中Lij是我的班级识别员(0或1)。我想知道是否需要先前的numpy变换。

1 个答案:

答案 0 :(得分:1)

要阅读ARFF文件,您需要安装liac-arff。请参阅链接了解详情。 安装完成后,使用以下代码读取ARFF文件

import arff
import numpy as np
# read arff data
with open("file.arff") as f:
    # load reads the arff db as a dictionary with
    # the data as a list of lists at key "data"
    dataDictionary = arff.load(f)
    f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])

有几种方法可以读取csv数据,我发现最简单的方法是使用Python模块read_csv中的函数Pandas。有关安装的详细信息,请参阅链接。 读取csv数据文件的代码在

之下
# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values

在任何一种情况下,您都会拥有一个包含数据的numpy数组。因为最后一列代表(类/目标/基础事实/标签)。您需要将数据分离到要素数组X和目标向量y。 e.g。

X = arffData[:, :-1]
y = arffData[:, -1]

其中X包含arffData中除最后一列之外的所有数据,y包含arffData中的最后一列

现在您可以使用任何supervised learning binary classifier from scikit-learn