使用熊猫处理的csv活尾

时间:2020-01-16 18:05:57

标签: python pandas

我有一个不断编写和附加到csv的过程

我想拥有一个使用熊猫通过csv聚合数据的python脚本尾部。我可能会汇总每100行,而每100行将汇总的数据发送到某个地方

有熊猫功能吗?有没有一种方法可以跟踪python脚本正在处理的行号,所以如果我停止它或它崩溃了,我可以再次启动它,它将在中断的地方继续运行?

1 个答案:

答案 0 :(得分:0)

如前所述,没有简单的内置方法可以做到这一点。但是,您可以将简单的关注功能(请参见How can I tail a log file in Python?)与Pandas结合使用,以聚合数据框。

我们使用跟随功能来尾部文件,将每行添加到列表中,一旦达到指定的行数,列表将转换为熊猫数据框。然后重置列表,我们继续关注该文件。正如另一位评论者所提到的,您可以将当前行号写入磁盘,并读取相同的文件以从上次中断的地方恢复。下面是一个示例。

import pandas as pd
from io import StringIO
import time
import os

def follow(thefile):
    with open(thefile, 'r') as mfile:
        while True:
           line = mfile.readline()
           if not line or not line.endswith('\n'):
               time.sleep(0.1)
               continue
           yield line

if __name__ == "__main__":
    # set the file we want to log the current line to
    log_file = "./current_line"

    # check if the last line processed has been saved
    if os.path.exists(log_file):
        with open(log_file, 'r') as ifile:
            # get the last line processed
            start_line = int(ifile.read())
    else:
        # set the last line processed to be the first data row (not the header).  If there is no header then set to 0
        start_line = 1

    # set the file we are reading
    myfile = 'test.csv'

    # remove this line if you don't need the header
    header = pd.read_csv(myfile, nrows=0)

    # initialize the list to store the lines in
    lines = []

    # loop through each line in the file
    for nline, line in enumerate(follow(myfile)):
        # if we have already processed this file
        if nline < start_line:
            continue
        # append to the lines list
        lines.append(line)

        # check if the we have hit the number of lines we want to handle
        if len(lines) == 100:
            # read the csv from the lines we have processed
            df = pd.read_csv(StringIO(''.join(lines)), header=None)
            # update the header.  Delete this row if there is no header
            df.columns = header.columns

            # do something with df
            print(df)

            # reset the lines list
            lines = []  

            # open the log file and note the line we have processed up to
            with open(log_file, 'w') as lfile:
                lfile.write(str(nline))  # only write the processed lines when we have actually done something
相关问题