使用to_csv时如何处理熊猫内存错误?

时间:2019-06-29 20:18:00

标签: python pandas csv dataframe

我当前正在linux系统中运行脚本。该脚本读取大约6000行的csv作为数据帧。该脚本的工作是打开一个数据框,例如:

@PostMapping("/player")
public void setPlayersList(@RequestBody String[] players) {
    for(int i = 0; i<players.length; i++) {
        playersList.add(players[i]);
    }
    System.out.println(Arrays.toString(playersList.toArray()));
}

收件人:

name       children
Bob        [Jeremy, Nancy, Laura]
Jennifer   [Kevin, Aaron]

并将其写入另一个文件(原始csv将保持不变)。

基本上添加一个新列,并为列表中的每个项目创建一行。 请注意,我正在处理一个包含7列的数据框,但是出于演示目的,我使用了一个较小的示例。实际csv中的列都是字符串,但其中两个是列表。

这是我的代码:

name       children                 childName
Bob        [Jeremy, Nancy, Laura]   Jeremy
Bob        [Jeremy, Nancy, Laura]   Nancy
Bob        [Jeremy, Nancy, Laura]   Laura
Jennifer   [Kevin, Aaron]           Kevin
Jennifer   [Kevin, Aaron]           Aaron

但是我遇到以下错误:


import ast
import os
import pandas as pd

cwd = os.path.abspath(__file__+"/..")
data= pd.read_csv(cwd+"/folded_data.csv", sep='\t', encoding="latin1")
output_path = cwd+"/unfolded_data.csv"

out_header = ["name", "children", "childName"]
count = len(data)
for idx, e in data.iterrows():
    print("Row ",idx," out of ",count)
    entry = e.values.tolist()
    c_lst = ast.literal_eval(entry[1])

    for c in c_lst :
        n_entry = entry + [c]
        if os.path.exists(output_path):
            output = pd.read_csv(output_path, sep='\t', encoding="latin1")
        else:
            output = pd.DataFrame(columns=out_header)

        output.loc[len(output)] = n_entry
        output.to_csv(output_path, sep='\t', index=False)

还有另一种方法可以做我想做的事而不会出现此错误吗?

编辑:csv文件,如果您想看看https://media.githubusercontent.com/media/lucas0/Annotator/master/annotator/data/folded_snopes.csv

EDIT2:我当前正在使用

Traceback (most recent call last):
  File "fileUnfold.py", line 31, in <module>
    output.to_csv(output_path, sep='\t', index=False)
  File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3020, in to_csv
    formatter.save()
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 172, in save
    self._save()
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 288, in _save
    self._save_chunk(start_i, end_i)
  File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 315, in _save_chunk
    self.cols, self.writer)
  File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
MemoryError

在第98行附近,程序开始明显减慢速度。我非常确定这是因为随着文件越来越大,我会一遍又一遍地读取文件。我该如何在不读取文件的情况下在文件上追加一行?

EDIT3:这是我用来处理第一次编辑中链接的数据的实际代码。这样可能更容易回答。

with open(output_path, 'w+') as f:
            output.to_csv(f, index=False, header=True, sep='\t')

2 个答案:

答案 0 :(得分:0)

尝试打开以将其保存到内存中,也许可以解决它。

  

如何在不读取文件的情况下在文件中追加一行?

from pathlib import Path

output_path= Path("/yourfolder/path")
with open(path1, 'w',  newline='') as f1, open(path2, 'r') as f2:
    file1= csv.writer(f1)
    #output.to_csv(f, header=False, sep=';') 
    file2 = csv.reader(f4)
    i = 0
    for row in file2:
        row.insert(1,output[i])
        file1.writerow(row)
        i += 1

答案 1 :(得分:0)

我停止读取输出文件,并停止为每个源编写文件。取而代之的是,我为输入数据的每一行创建一个带有新行的数据框,然后将其附加到samples.csv。

代码:

import ast
import os
import pandas as pd

cwd = os.path.abspath(__file__+"/..")
snopes = pd.read_csv(cwd+"/folded_snopes.csv", sep='\t', encoding="latin1")
output_path = cwd+"/samples.csv"

out_header = ["page", "claim", "verdict", "tags", "date", "author","source_list","source_url"]
count = len(snopes)
is_first = True

for idx, e in snopes.iterrows():
    print("Row ",idx," out of ",count)
    entry = e.values.tolist()
    src_lst = ast.literal_eval(entry[6])
    output = pd.DataFrame(columns=out_header)
    for src in src_lst:
        n_entry = entry + [src]
        output.loc[len(output)] = n_entry

    output.to_csv(output_path, sep='\t', header=is_first, index=False, mode='a')
    is_first = False