Question

我想处理一个文本文件（逐行）。（最初未知的）连续行数属于同一实体（即，它们与该行携带相同的标识符）。例如：

line1: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line2: stuff, stuff2, stuff3, ID1, stuff4, stuff5    
line3: stuff, stuff2, stuff3, ID1, stuff4, stuff5
line4: stuff, stuff2, stuff3, ID2, stuff4, stuff5
line5: stuff, stuff2, stuff3, ID2, stuff4, stuff5
...

在该虚拟线中，1-3属于实体ID1，线4-5属于ID2。我想将这些行中的每一行作为字典阅读，然后将它们嵌入到包含IDX所有字典的字典中（例如，分别带有3行嵌套字典的字典ID1）。

更具体地说，我想定义一个函数：

打开文件
将所有（但仅限于）实体ID1的行读入单个词典
返回带有ID1行

我希望能够再次调用该函数，以便在下一个字典中读取以下标识符（ID2）和后来的ID3等所有行。我遇到的一个问题是我需要在每一行测试我的当前行是否仍然带有感兴趣的ID或已经是新的ID。如果它是一个新的，我肯定可以停止并返回字典，但在下一轮（比如，ID2），ID2的第一行已经被读取，因此我似乎失去了那条线。

换句话说：一旦遇到具有新ID的行，我想以某种方式重置函数中的计数器，以便在下一次迭代中，带有新ID的第一行不会丢失。

这似乎是一项直截了当的任务，但我无法找到一种方法来优雅地做到这一点。我目前在函数之间传递一些“内存” - 标志/变量，以便跟踪在前一次迭代中是否已经读取了新ID的第一行。这非常庞大且容易出错。

感谢阅读...任何想法/提示都受到高度赞赏。如果有些问题不清楚，请询问。

这是我的“解决方案”。它似乎工作在正确打印字典的意义上（虽然我确信有一种更优雅的方式来做到这一点）。我也忘了提到文本文件非常大，因此我想通过ID处理ID，而不是将整个文件读入内存。

with open(infile, "r") as f:
    newIDLine = None
    for line in f:
        if not line:
            break
        # the following function returns the ID
        ID = get_ID_from_line(line)
        counter = 1
        ID_Dic = dict()
        # if first line is completely new (i.e. first line in infile)
        if newIDLine is None:
            currID = ID
            # the following function returns the line as a dic
            ID_Dic[counter] = process_line(line)
        # if first line of new ID was already read in
        # the previous "while" iteration (see below).
        if newIDLine is not None:
            # if the current "line" is of the same ID then the
            # previous one: put previous and current line in
            # the same dic and start the while loop.
            if ID == oldID:
                ID_Dic[counter] = process_line(newIDLine)
                counter += 1
                ID_Dic[counter] = process_line(line)
                currID = ID
        # iterate over the following lines until file end or
        # new ID starts. In the latter case: keep the info in
        # objects newIDline and oldID
        while True:
            newLine = next(f)
            if not newLine:
                break
            ID = get_ID_from_line(newLine)
            if ID == currID:
                counter += 1
                ID_Dic[counter] = process_line(newLine)
            # new ID; save line for the upcomming ID dic
            if not ID == currID:
                newIDLine = newLine
                oldID = ID
                break
    # at this point it would be great to return the Dictionary of
    # the current ID to the calling function but at return to this
    # function continue where I left off.
    print ID_Dic

Answer 1

您可以使用字典来跟踪所有IDX列，只需将每行的IDX列添加到字典中的相应列表中，例如：

from collections import defaultdict
import csv

all_lines_dict = defaultdict(list)

with open('your_file') as f:
  csv_reader = csv.reader(f)            
  for line_list in csv_reader:
    all_lines_dict[line_list[3]].append(line_list)

Csv reader是python标准库的一部分，可以轻松读取csv文件。它会将每一行读作列的列表。

这与您的要求不同，因为每个键不是字典词典，而是共享IDX键的行列表。

Answer 2

如果你希望这个函数懒惰地为每个id返回一个dict，你应该使用yield而不是return来使它成为一个生成器函数。在每个id的末尾，产生该id的dict。然后你可以迭代那个生成器。

要处理该文件，请编写一个迭代源的生成器函数，除非您向其发送一个值，在这种情况下，它会返回该值，然后返回迭代。（例如，这是我为自己写的一个模块：politer.py。）

如果你不想要它，你可以通过发送值“返回”来轻松解决这个问题：

with open(infile, 'r') as f:
    polite_f = politer(f)
    current_id = None
    while True:
        id_dict = {}
        for i, line in enumerate(polite_f):
            id = get_id_from_line(line)
            if id != current_id:
                polite_f.send(line)
                break
            else:
                id_dict[i] = process_line(line)
        if current_id is not None:
            yield id_dict
        current_id = id

请注意，这会使状态处理在其所属的生成器中被抽象。

从嵌套字典中的文件中读取最初未知数量的N行，并在下一次迭代中从第N + 1行开始

2 个答案: