我有两个csv和arff格式的数据集,我一直在weka的分类模型中使用。我想知道这些格式是否可以在scikit中用于在python中尝试其他分类方法。
这就是我的数据集的样子: ASSAY_CHEMBLID ... MDEN.23 ... MA,TARGET_TYPE ......不...... MA,TARGET_TYPE ... ... APOL MA,TARGET_TYPE ... ATSm5 ... MA,TARGET_TYPE ... SCH.6。 ..MA,TARGET_TYPE ... SPC.6 ... MA,TARGET_TYPE ... SP.3 ... MA,TARGET_TYPE ... MDEN.12 ... MA,TARGET_TYPE ... MDEN.22 ... MA,TARGET_TYPE ... MLogP ... MA,TARGET_TYPE ... R ... MA,TARGET_TYPE ... ...摹MA,TARGET_TYPE ......我...... MA,机体...不... MA,机体... ... C2SP1 MA,机体... ... VC.6 MA,机体... ...化工网MA,机体... ... khs.aasC MA,机体... MDEC.12 ... MA,机体... ... MDEC.13 MA,机体... ... MDEC.23 MA,机体... ... MDEC.33 MA,机体... MDEO。 11 ... MA,机体... ... MDEN.22 MA,机体... ... topoShape MA,机体... ... WPATH MA,机体。P ... MA,LIJ 0.202796,0.426972,0.117596,0.143818,0.072542,0.158172,0.136301,0.007245,0.016986,0.488281,0.300438,0.541931,0.644161,0.048149,0.02002,0,0.503415,0.153457,0.288099,0.186024,0.216833,0.184642,0,0.011592,0.00089, 0,0.209406,0
其中Lij是我的班级识别员(0或1)。我想知道是否需要先前的numpy变换。
答案 0 :(得分:1)
要阅读ARFF文件,您需要安装liac-arff。请参阅链接了解详情。 安装完成后,使用以下代码读取ARFF文件
import arff
import numpy as np
# read arff data
with open("file.arff") as f:
# load reads the arff db as a dictionary with
# the data as a list of lists at key "data"
dataDictionary = arff.load(f)
f.close()
# extract data and convert to numpy array
arffData = np.array(dataDictionary['data'])
有几种方法可以读取csv数据,我发现最简单的方法是使用Python模块read_csv
中的函数Pandas。有关安装的详细信息,请参阅链接。
读取csv数据文件的代码在
# read csv data
import pandas as pd
csvData = pd.read_csv("filename.csv",sep=',').values
在任何一种情况下,您都会拥有一个包含数据的numpy数组。因为最后一列代表(类/目标/基础事实/标签)。您需要将数据分离到要素数组X
和目标向量y
。 e.g。
X = arffData[:, :-1]
y = arffData[:, -1]
其中X
包含arffData
中除最后一列之外的所有数据,y
包含arffData
中的最后一列
现在您可以使用任何supervised learning binary classifier from scikit-learn。