通过块读取Pandas中的CSV时不一致的行为

时间:2017-03-08 21:05:07

标签: python csv pandas chunks

这是我在Stack Overflow上的第一个问题,在这个问题上挣扎了整整一天 当使用带有pandas.read_csv()选项的chunksize加载大型.csv文件时,我会得到不一致的结果,就好像循环在每次迭代时从循环读取的数据不完全独立。此外,仅在第一次迭代时才能正确读取和处理数据。这是我创建的一个简化示例,显示了这一点:

将pandas导入为pd      导入numpy为np

 a = pd.DataFrame(np.random.randn(500, 1), columns=list('A'))
 b = pd.DataFrame(np.random.randn(500, 1), columns=list('B'))
 c = pd.DataFrame(np.random.randn(500, 1), columns=list('C'))
 c.to_csv("./c.csv", index=False, sep="\t")

 i = 1 

 for data in pd.read_csv("./c.csv", delimiter='\t', chunksize = 200):


         print("\n\nIteration No.:" + str(i))
         print("First five elements of data before concatenation: \n" + repr(data.loc[:5,'C']))

         print("First element of a: " + str(a['A'][0]) + ". Type:" + repr(type(a['A'][0])))
         print("First element of b: " + str(b['B'][0]) + ". Type:" + repr(type(b['B'][0])))
         print("First element of data: " + str(data['C'].iloc[0]) + ". Type:" + repr(type(data['C'].iloc[0])))

         data['C'] =  a['A'].map(str) +  b['B'].map(str) + data['C'].map(str)
         print("\n\nFirst five elements of data after concatenation: \n" + repr(data.loc[:5,'C']))

该摘录的输出如下:

   Iteration No.:1
   First five elements of data before concatenation: 
   0    0.272127
   1    1.702455
   2    0.073175
   3   -1.415413
   4    0.023546
   5   -0.706802
   Name: C, dtype: float64
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: 0.27212690258. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   0    -1.28607575145810920.77868286611354330.2721269...
   1    0.242774791222815281.29275536671509881.7024547...
   2    0.4524774082028631-1.17833662685619570.0731746...
   3    1.4351094358436494-0.5173279482942412-1.415413...
   4    -1.7578744077531847-1.59454228118368470.023546...
   5    -0.50656599412173-0.3809749686364225-0.7068022...
   Name: C, dtype: object


   Iteration No.:2
   First five elements of data before concatenation: 
   Series([], Name: C, dtype: float64)
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: 0.995788479453. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   Series([], Name: C, dtype: object)


   Iteration No.:3
   First five elements of data before concatenation: 
   Series([], Name: C, dtype: float64)
   First element of a: -1.28607575146. Type:<class 'numpy.float64'>
   First element of b: 0.778682866114. Type:<class 'numpy.float64'>
   First element of data: -0.188555175182. Type:<class 'numpy.float64'>


   First five elements of data after concatenation: 
   Series([], Name: C, dtype: object)           

如您所见, data.loc [:5,&#39; C&#39;] 在第二次和第三次迭代时产生一个空系列,而数据[&# 39; C&#39;]。iloc [0] 总是产生非空值。

我已尝试在Python 3.5.3上将pandas升级到最新版本(0.19.2)。我还用Pandas 0.19.0降级到Python 2.7.12并且没有骰子。

任何帮助将不胜感激。非常感谢你提前!

0 个答案:

没有答案