Question

由于数据量庞大，我们使用pandas处理数据，但发生了一个非常奇怪的现象。伪代码如下所示：

reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
    result = []
    for line in chunk.tolist():
         temp = complicated_process(chunk)  # this involves a very complicated processing, so here is just a simplified version
         result.append(temp)
    chunk['new_series'] = pd.series(result)
    chunk.to_csv(OUT_TILE, index=False, mode='a')

我们可以确认结果的每个循环都不为空。但只有在循环的第一次，行chunk['new_series'] = pd.series(result)有结果，其余为空。因此，只有输出的第一个块包含new_series，其余的都是空的。

我们在这里错过了什么吗？提前谢谢。

Answer 1

你应该在你的循环之上声明result，否则你只是用每个块重新初始化它。

result = []
for chunk in reader:
    ...

您之前的方法在功能上等同于：

for chunk in reader:
    del result  # because it is being re-assigned on the following line.
    result = []
    result.append(something)
print(result)  # Only shows result from last chunk in reader (the last loop).

另外，我建议：

chunk = chunk.assign(new_series=result)  # Instead of `chunk['new_series'] = pd.series(result)`.

我假设您正在对line中的for loop变量执行某些操作，即使上面的示例中未使用该变量。

Answer 2

更好的解决方案是：

reader = pd.read_csv(IN_FILE, chunksize = 1000, engine='c')
for chunk in reader:
    result = []
    for line in chunk.tolist():
        temp = complicated_process(chunk)  # this involves a very complicated processing, so here is just a simplified version
        result.append(temp)
    new_chunk = chunk.reset_index()
    new_chunk = new_chunk.assign(new_series=result)
    new_chunk.to_csv(OUT_TILE, index=False, mode='a')

注意：每个块的索引不是单独的，而是派生整个文件。如果我们从每个循环追加一个新系列，那么chunk将从整个文件继承索引。因此，每个块的索引与新系列不匹配。

@Alexander的解决方案有效，但result可能变得很大，因此会占用太多内存。

此处的新解决方案将通过new_chunk = chunk.reset_index()重置每个块的索引，并且result将在每个循环内重置。这节省了大量内存。

pandas.read_csv函数与chunksize选项

2 个答案: