在Python

时间:2018-02-15 11:21:37

标签: python-3.x

我尝试使用以下格式处理管道分隔的文本文件:

18511|1|2587198|2004-03-31|0|100000|0|1.97|0.49988|100000||||
18511|2|2587198|2004-06-30|0|160000|0|3.2|0.79669|60000|60|||
18511|3|2587198|2004-09-30|0|160000|0|2.17|0.79279|0|0|||
18511|4|2587198|2004-09-30|0|160000|0|1.72|0.79118|0|0|||
18511|5|2587198|2005-03-31|0|0|0|0|0|-160000|-100|||19
18511|1|2587940|2004-03-31|0|240000|0|0.78|0.27327|240000||||
18511|2|2587940|2004-06-30|0|560000|0|1.59|0.63576|320000|133.33||24|
18511|3|2587940|2004-09-30|0|560000|0|1.13|0.50704|0|0|||
18511|4|2587940|2004-09-30|0|560000|0|0.96|0.50704|0|0|||
18511|5|2587940|2005-03-31|0|0|0|0|0|-560000|-100|||14

对于每一行,我想隔离第二个字段并将该行写入具有该字段作为文件名一部分的文件,例如issue1.txt,issue2.txt,其中数字是上述文件摘录中的第二个字段。这个数字可以在1到56之间。我的代码如下所示:

with open('d:\\tmp\issueholding.txt') as f, open('d:\\tmp\issue1.txt', 'w') as out_f1,\
open('d:\\tmp\issue2.txt', 'w') as out_f2,open('d:\\tmp\issue3.txt', 'w') as out_f3,\
open('d:\\tmp\issue4.txt', 'w') as out_f4,open('d:\\tmp\issue5.txt', 'w') as out_f5,\
open('d:\\tmp\issue6.txt', 'w') as out_f6,open('d:\\tmp\issue7.txt', 'w') as out_f7,\
open('d:\\tmp\issue8.txt', 'w') as out_f8,open('d:\\tmp\issue9.txt', 'w') as out_f9,\
open('d:\\tmp\issue10.txt', 'w') as out_f10,open('d:\\tmp\issue11.txt', 'w') as out_f11,\
open('d:\\tmp\issue12.txt', 'w') as out_f12,open('d:\\tmp\issue13.txt', 'w') as out_f13,\
open('d:\\tmp\issue14.txt', 'w') as out_f14,open('d:\\tmp\issue15.txt', 'w') as out_f15,\
open('d:\\tmp\issue16.txt', 'w') as out_f16,open('d:\\tmp\issue17.txt', 'w') as out_f17,\
open('d:\\tmp\issue18.txt', 'w') as out_f18,open('d:\\tmp\issue19.txt', 'w') as out_f19,\
open('d:\\tmp\issue20.txt', 'w') as out_f20,open('d:\\tmp\issue21.txt', 'w') as out_f21,\
open('d:\\tmp\issue22.txt', 'w') as out_f22,open('d:\\tmp\issue23.txt', 'w') as out_f23,\
open('d:\\tmp\issue24.txt', 'w') as out_f24,open('d:\\tmp\issue25.txt', 'w') as out_f25,\
open('d:\\tmp\issue32.txt', 'w') as out_f32,open('d:\\tmp\issue33.txt', 'w') as out_f33,\
open('d:\\tmp\issue34.txt', 'w') as out_f34,open('d:\\tmp\issue35.txt', 'w') as out_f35,\
open('d:\\tmp\issue36.txt', 'w') as out_f36,open('d:\\tmp\issue37.txt', 'w') as out_f37,\
open('d:\\tmp\issue38.txt', 'w') as out_f38,open('d:\\tmp\issue39.txt', 'w') as out_f39,\
open('d:\\tmp\issue40.txt', 'w') as out_f40,open('d:\\tmp\issue41.txt', 'w') as out_f41,\
open('d:\\tmp\issue42.txt', 'w') as out_f42,open('d:\\tmp\issue43.txt', 'w') as out_f43,\
open('d:\\tmp\issue44.txt', 'w') as out_f44,open('d:\\tmp\issue45.txt', 'w') as out_f45,\
open('d:\\tmp\issue46.txt', 'w') as out_f46,open('d:\\tmp\issue47.txt', 'w') as out_f47,\
open('d:\\tmp\issue48.txt', 'w') as out_f48,open('d:\\tmp\issue49.txt', 'w') as out_f49,\
open('d:\\tmp\issue50.txt', 'w') as out_f50,open('d:\\tmp\issue51.txt', 'w') as out_f51,\
open('d:\\tmp\issue52.txt', 'w') as out_f52,open('d:\\tmp\issue53.txt', 'w') as out_f53,\
open('d:\\tmp\issue54.txt', 'w') as out_f54,open('d:\\tmp\issue55.txt', 'w') as out_f55,\
open('d:\\tmp\issue56.txt', 'w') as out_f56:
    for line in f:
        field1_end = line.find('|') +1
        field2_end = line.find('|',field1_end)
        f2=line[field1_end:field2_end]
        out_f56.write(line)

我的两个问题是:

1)尝试运行上述内容时,我收到以下错误消息

文件"",行未知 SyntaxError:太多静态嵌套块

2)如何更改此行out_f56.write(line),以便我可以将变量f2用作文件描述符的一部分,而不是对其进行硬编码。

我在Windows下运行python3的jupyter笔记本中运行它。需要明确的是,输入文件大约有2.35亿条记录,因此性能至关重要。

感谢任何帮助或建议

1 个答案:

答案 0 :(得分:1)

尝试这样的事情(请参阅代码中的注释以获得解释):

with open(R"d:\tmp\issueholding.txt") as f:
    for line in f:
        # splitting line into list of strings at '|' character
        fields = line.split('|')

        # defining output file name according to issue code in second field
        # NB: list-indexes are zero-based, therefore use 1
        out_name = R"d:\tmp\issue%s.txt" % fields[1]

        # opening output file and writing current line to it
        # NB: make sure you use the 'a+' mode to append to existing file
        with open(out_name, 'a+') as ff:
            ff.write(line)

为避免在阅读循环内反复打开文件,您可以执行以下操作:

from collections import defaultdict

with open(R"D:\tmp\issueholding.txt") as f:

    # setting up dictionary to hold lines grouped by issue code
    # using a defaultdict here to automatically create a list when inserting
    # the first item
    collected_issues = defaultdict(list)

    for line in f:
        # splitting line into list of strings at '|' character and retrieving
        # current issue code from second token
        issue_code = line.split('|')[1]
        # appending current line to list of collected lines associated with
        # current issue code
        collected_issues[issue_code].append(line)
    else:
        for issue_code in collected_issues:
            # defining output file name according to issue code
            out_name = R"D:\tmp\issue%s.txt" % issue_code
            # opening output file and writing collected lines to it
            with open(out_name, 'a+') as ff:
                ff.write("".join(collected_issues[issue_code]))

这当然会创建一个内存中的字典,其中包含从输入文件中检索的所有行。根据您的规格,您的机器很可能无法实现。另一种方法是拆分输入文件并通过块处理它。这可以通过定义从输入文件读取定义的行数(此处:1000)的相应生成器在代码中实现。可能的最终解决方案可能如下所示:

from itertools import islice
from collections import defaultdict


def get_chunk_of_lines(file, N):
    """
    Retrieves N lines from specified opened file.
    """
    return [x.strip() for x in islice(file, N)]


def collect_issues(lines):
    """
    Collects and groups issues from specified lines.
    """
    collected_issues = defaultdict(list)

    for line in lines:
        # splitting line into list of strings at '|' character and retrieving
        # current issue code from second token
        issue_code = line.split('|')[1]
        # appending current line to list of collected lines associated with
        # current issue code
        collected_issues[issue_code].append(line)

    return collected_issues


def export_grouped_issues(issues):
    """
    Exports collected and grouped issues.
    """
    for issue_code in issues:
        # defining output file name according to issue code
        out_name = R"D:\tmp\issue%s.txt" % issue_code
        # opening output file and writing collected lines to it
        with open(out_name, 'a+') as f:
            f.write("".join(issues[issue_code]))


with open(R"D:\tmp\issueholding.txt") as issue_src:

    chunk_cnt = 0

    while True:
        # retrieving 1000 input lines at a time
        line_chunk = get_chunk_of_lines(issue_src, 1000)

        # exiting while loop if no more chunk is left
        if not line_chunk:
            break

        chunk_cnt += 1
        print("+ Working on chunk %d" % chunk_cnt)

        # collecting, grouping and exporting issues
        issues = collect_issues(line_chunk)
        export_grouped_issues(issues)