我当前正在linux系统中运行脚本。该脚本读取大约6000行的csv作为数据帧。该脚本的工作是打开一个数据框,例如:
@PostMapping("/player")
public void setPlayersList(@RequestBody String[] players) {
for(int i = 0; i<players.length; i++) {
playersList.add(players[i]);
}
System.out.println(Arrays.toString(playersList.toArray()));
}
收件人:
name children
Bob [Jeremy, Nancy, Laura]
Jennifer [Kevin, Aaron]
并将其写入另一个文件(原始csv将保持不变)。
基本上添加一个新列,并为列表中的每个项目创建一行。 请注意,我正在处理一个包含7列的数据框,但是出于演示目的,我使用了一个较小的示例。实际csv中的列都是字符串,但其中两个是列表。
这是我的代码:
name children childName
Bob [Jeremy, Nancy, Laura] Jeremy
Bob [Jeremy, Nancy, Laura] Nancy
Bob [Jeremy, Nancy, Laura] Laura
Jennifer [Kevin, Aaron] Kevin
Jennifer [Kevin, Aaron] Aaron
但是我遇到以下错误:
import ast
import os
import pandas as pd
cwd = os.path.abspath(__file__+"/..")
data= pd.read_csv(cwd+"/folded_data.csv", sep='\t', encoding="latin1")
output_path = cwd+"/unfolded_data.csv"
out_header = ["name", "children", "childName"]
count = len(data)
for idx, e in data.iterrows():
print("Row ",idx," out of ",count)
entry = e.values.tolist()
c_lst = ast.literal_eval(entry[1])
for c in c_lst :
n_entry = entry + [c]
if os.path.exists(output_path):
output = pd.read_csv(output_path, sep='\t', encoding="latin1")
else:
output = pd.DataFrame(columns=out_header)
output.loc[len(output)] = n_entry
output.to_csv(output_path, sep='\t', index=False)
还有另一种方法可以做我想做的事而不会出现此错误吗?
编辑:csv文件,如果您想看看https://media.githubusercontent.com/media/lucas0/Annotator/master/annotator/data/folded_snopes.csv
EDIT2:我当前正在使用
Traceback (most recent call last):
File "fileUnfold.py", line 31, in <module>
output.to_csv(output_path, sep='\t', index=False)
File "/usr/local/lib/python3.5/dist-packages/pandas/core/generic.py", line 3020, in to_csv
formatter.save()
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 172, in save
self._save()
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 288, in _save
self._save_chunk(start_i, end_i)
File "/usr/local/lib/python3.5/dist-packages/pandas/io/formats/csvs.py", line 315, in _save_chunk
self.cols, self.writer)
File "pandas/_libs/writers.pyx", line 75, in pandas._libs.writers.write_csv_rows
MemoryError
在第98行附近,程序开始明显减慢速度。我非常确定这是因为随着文件越来越大,我会一遍又一遍地读取文件。我该如何在不读取文件的情况下在文件上追加一行?
EDIT3:这是我用来处理第一次编辑中链接的数据的实际代码。这样可能更容易回答。
with open(output_path, 'w+') as f:
output.to_csv(f, index=False, header=True, sep='\t')
答案 0 :(得分:0)
尝试打开以将其保存到内存中,也许可以解决它。
如何在不读取文件的情况下在文件中追加一行?
from pathlib import Path
output_path= Path("/yourfolder/path")
with open(path1, 'w', newline='') as f1, open(path2, 'r') as f2:
file1= csv.writer(f1)
#output.to_csv(f, header=False, sep=';')
file2 = csv.reader(f4)
i = 0
for row in file2:
row.insert(1,output[i])
file1.writerow(row)
i += 1
答案 1 :(得分:0)
我停止读取输出文件,并停止为每个源编写文件。取而代之的是,我为输入数据的每一行创建一个带有新行的数据框,然后将其附加到samples.csv。
代码:
import ast
import os
import pandas as pd
cwd = os.path.abspath(__file__+"/..")
snopes = pd.read_csv(cwd+"/folded_snopes.csv", sep='\t', encoding="latin1")
output_path = cwd+"/samples.csv"
out_header = ["page", "claim", "verdict", "tags", "date", "author","source_list","source_url"]
count = len(snopes)
is_first = True
for idx, e in snopes.iterrows():
print("Row ",idx," out of ",count)
entry = e.values.tolist()
src_lst = ast.literal_eval(entry[6])
output = pd.DataFrame(columns=out_header)
for src in src_lst:
n_entry = entry + [src]
output.loc[len(output)] = n_entry
output.to_csv(output_path, sep='\t', header=is_first, index=False, mode='a')
is_first = False