我有一个不断编写和附加到csv的过程
我想拥有一个使用熊猫通过csv聚合数据的python脚本尾部。我可能会汇总每100行,而每100行将汇总的数据发送到某个地方
有熊猫功能吗?有没有一种方法可以跟踪python脚本正在处理的行号,所以如果我停止它或它崩溃了,我可以再次启动它,它将在中断的地方继续运行?
答案 0 :(得分:0)
如前所述,没有简单的内置方法可以做到这一点。但是,您可以将简单的关注功能(请参见How can I tail a log file in Python?)与Pandas结合使用,以聚合数据框。
我们使用跟随功能来尾部文件,将每行添加到列表中,一旦达到指定的行数,列表将转换为熊猫数据框。然后重置列表,我们继续关注该文件。正如另一位评论者所提到的,您可以将当前行号写入磁盘,并读取相同的文件以从上次中断的地方恢复。下面是一个示例。
import pandas as pd
from io import StringIO
import time
import os
def follow(thefile):
with open(thefile, 'r') as mfile:
while True:
line = mfile.readline()
if not line or not line.endswith('\n'):
time.sleep(0.1)
continue
yield line
if __name__ == "__main__":
# set the file we want to log the current line to
log_file = "./current_line"
# check if the last line processed has been saved
if os.path.exists(log_file):
with open(log_file, 'r') as ifile:
# get the last line processed
start_line = int(ifile.read())
else:
# set the last line processed to be the first data row (not the header). If there is no header then set to 0
start_line = 1
# set the file we are reading
myfile = 'test.csv'
# remove this line if you don't need the header
header = pd.read_csv(myfile, nrows=0)
# initialize the list to store the lines in
lines = []
# loop through each line in the file
for nline, line in enumerate(follow(myfile)):
# if we have already processed this file
if nline < start_line:
continue
# append to the lines list
lines.append(line)
# check if the we have hit the number of lines we want to handle
if len(lines) == 100:
# read the csv from the lines we have processed
df = pd.read_csv(StringIO(''.join(lines)), header=None)
# update the header. Delete this row if there is no header
df.columns = header.columns
# do something with df
print(df)
# reset the lines list
lines = []
# open the log file and note the line we have processed up to
with open(log_file, 'w') as lfile:
lfile.write(str(nline)) # only write the processed lines when we have actually done something