Question

我有一个非常大的CSV文件，我通过迭代与熊猫＆＃39;块功能。问题：如果是chunksize = 2，它跳过前2行，我收到的第一个块是3-4行。

基本上，如果我用nrows = 4读取CSV，我会获得前4行，同时使用chunksize = 2获取相同的文件第一行第3行和第4行，然后是5和6，...

#1. Read with nrows  
#read first 4 rows in csv files and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], nrows=4)

print (reader)

01/01/2016 - 09:30 - A - 100
01/01/2016 - 13:30 - A - 110
01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

#2. Iterate over csv file with chunks
#iterate over csv file in chunks and merge date and time column to be used as index
reader = pd.read_csv('filename.csv', delimiter=',', parse_dates={"Datetime" : [1,2]}, index_col=[0], chunksize=2)

for chunk in reader:

    #create a dataframe from chunks
    df = reader.get_chunk()
    print (df)

01/01/2016 - 15:30 - A - 120
02/01/2016 - 10:30 - A - 115

将chunksize增加到10会跳过前10行。

我有什么想法可以解决这个问题？我已经有了一个有效的解决方法，我想了解我错在哪里。

感谢任何输入！

Answer 1

请勿致电get_chunk。您已经拥有了自己的块，因为您正在遍历阅读器，即chunk是您的DataFrame。在循环中调用print(chunk)，您应该看到预期的输出。

正如@MaxU在评论中指出的那样，如果您想要不同大小的块，则需要使用get_chunk：reader.get_chunk(500)，reader.get_chunk(100)等。

为什么Pandas在我的代码

1 个答案: