这是我在Stack Overflow上的第一个问题,在这个问题上挣扎了整整一天
当使用带有pandas.read_csv()
选项的chunksize
加载大型.csv文件时,我会得到不一致的结果,就好像循环在每次迭代时从循环读取的数据不完全独立。此外,仅在第一次迭代时才能正确读取和处理数据。这是我创建的一个简化示例,显示了这一点:
将pandas导入为pd 导入numpy为np
a = pd.DataFrame(np.random.randn(500, 1), columns=list('A'))
b = pd.DataFrame(np.random.randn(500, 1), columns=list('B'))
c = pd.DataFrame(np.random.randn(500, 1), columns=list('C'))
c.to_csv("./c.csv", index=False, sep="\t")
i = 1
for data in pd.read_csv("./c.csv", delimiter='\t', chunksize = 200):
print("\n\nIteration No.:" + str(i))
print("First five elements of data before concatenation: \n" + repr(data.loc[:5,'C']))
print("First element of a: " + str(a['A'][0]) + ". Type:" + repr(type(a['A'][0])))
print("First element of b: " + str(b['B'][0]) + ". Type:" + repr(type(b['B'][0])))
print("First element of data: " + str(data['C'].iloc[0]) + ". Type:" + repr(type(data['C'].iloc[0])))
data['C'] = a['A'].map(str) + b['B'].map(str) + data['C'].map(str)
print("\n\nFirst five elements of data after concatenation: \n" + repr(data.loc[:5,'C']))
该摘录的输出如下:
Iteration No.:1
First five elements of data before concatenation:
0 0.272127
1 1.702455
2 0.073175
3 -1.415413
4 0.023546
5 -0.706802
Name: C, dtype: float64
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: 0.27212690258. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
0 -1.28607575145810920.77868286611354330.2721269...
1 0.242774791222815281.29275536671509881.7024547...
2 0.4524774082028631-1.17833662685619570.0731746...
3 1.4351094358436494-0.5173279482942412-1.415413...
4 -1.7578744077531847-1.59454228118368470.023546...
5 -0.50656599412173-0.3809749686364225-0.7068022...
Name: C, dtype: object
Iteration No.:2
First five elements of data before concatenation:
Series([], Name: C, dtype: float64)
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: 0.995788479453. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
Series([], Name: C, dtype: object)
Iteration No.:3
First five elements of data before concatenation:
Series([], Name: C, dtype: float64)
First element of a: -1.28607575146. Type:<class 'numpy.float64'>
First element of b: 0.778682866114. Type:<class 'numpy.float64'>
First element of data: -0.188555175182. Type:<class 'numpy.float64'>
First five elements of data after concatenation:
Series([], Name: C, dtype: object)
如您所见, data.loc [:5,&#39; C&#39;] 在第二次和第三次迭代时产生一个空系列,而数据[&# 39; C&#39;]。iloc [0] 总是产生非空值。
我已尝试在Python 3.5.3上将pandas升级到最新版本(0.19.2)。我还用Pandas 0.19.0降级到Python 2.7.12并且没有骰子。
任何帮助将不胜感激。非常感谢你提前!