我正在尝试打开此数据集:https://www.kaggle.com/dalpozz/creditcardfraud
使用Ipython笔记本。我试过了:
data = pd.read_csv("...Desktop/creditcard.csv")
得到了:
CParserError:标记数据时出错。 C错误:内存不足。
然后我尝试了Noobie指出的解决方案: Error tokenizing data. C error: out of memory pandas python, large file csv
现在它可以加载数据。但是,现在我的数据看起来像一个矩阵:
entry 0,0: blank;
entry 0,1: All the headers are here;
entry 1,0: 0
entry 1,1: A whole line of unseparated data here
entry 2,0: 1
entry 2,1: A whole line of unseparated data here
...
如何正确格式化数据?
我的实施:
mylist = []
for chunk in pd.read_csv('.../Desktop/creditcard.csv', sep=',', chunksize=2000):
mylist.append(chunk)
data = pd.concat(mylist, axis= 0)
del mylist
几行数据:
第1行:时间," V1"," V2"," V3"," V4"," V5", " V6"" V7"" V8"" V9"" V10"" V11& #34;" V12"" V13"" V14"" V15"" V16",& #34; V17"" V18"" V19"" V20"" V21"" V22&# 34;," V23"" V24"" V25"" V26"" V27"&# 34; V28""量""级"
第二行:
0,-1.3598071336738,-0.0727811733098497,2.53634673796914,1.37815522427443,-0.338320769942518,0.462387777762292,0.239598554061257,0.0986979012610507,0.363786969611213,0.0907941719789316,-0.551599533260813,-0.617800855762348,-0.991389847235408,-0.311169353699879,1.46817697209427,-0.470400525259478,0.207971241929242,0.0257905801985591,0.403992960255733,0.251412098239705, -0.018306777944153,0.277837575558899,-0.110473910188767,0.0669280749146731,0.128539358273528,-0.189114843888824,0.133558376740387,-0.0210530534538215,149.62," 0"