Question

我正在尝试编写一个循环遍历名称列表的注释器，在出现这些名称时标记单独的文档。这些名称可以包含一个或两个单词。

程序上的缓冲区工作，因此它识别是否需要查看文件的一行或两行以进行标记，并在出现的名称与候选项完全匹配时进行标记。

但是，不是循环遍历列表中的每个候选项的所有名称，而是在该特定回合中的循环中取名称，如果名称与候选项不匹配，则写入该行并移至下一行（列表中的下一个名称）。这显然会导致文件中有许多名称，这些名称在应该标记时没有标记。

以下是我的代码：

import json
from tagging import import_names


def split_line(line):
    """Split a line into four parts, word, pos, lemma and tag."""
    # TODO: Speak to Diana about the spaces in the vert file - do they mean
    # anything?
    line = line.strip().split()
    if len(line) == 1:
        word = line[0]
        pos, lemma, tag = None, None, None
    elif len(line) == 3:
        word, pos, lemma = line
        tag = ''
    elif len(line) == 4:
        word, pos, lemma, tag = line
    return [word, pos, lemma, tag]


class MWUTagger(object):
    """Contains a buffer of lines split into word, pos, lemma, tag items."""
    def __init__(self, f_in, f_out, n, gnrd_file, indicators=None):
        """Populate the buffer."""
        # read the input vert file
        self.f_in = open(f_in, 'r')
        # populate the buffer (first n lines of the vert file)
        self.buffer = []
        for i in range(n):
            self.buffer.append(split_line(self.f_in.readline()))
        # read in list of names or save
        self.names = import_names(gnrd_file)
        # create the output vert file
        self.f_out = f_out

    def __iter__(self):
        return self

    def write_line(self):
        """Write out the oldest line in the buffer, and add a new line to the buffer."""
        # write the oldest line from the buffer
        tagged_line = self.buffer.pop(0)
        tagged_line = [i for i in tagged_line if i]
        with open(self.f_out, 'a') as f:
            if tagged_line[0].startswith('<') and tagged_line[-1].endswith('>'):
                f.write(' '.join(tagged_line) + '\n')
            else:
                f.write('\t'.join(tagged_line) + '\n')

    def __next__(self):
        """write out the oldest line in the buffer and add a new line to the buffer"""
        #write the oldest line from the buffer
        self.write_line()
        # add a new line to the buffer (found an example here https://bufferoverflow.com/a/14797993/1706564)
        line = self.f_in.readline()
        if line:
            self.buffer.append(split_line(line))
        else:
            self.f_in.close()
            self.flush()
            raise StopIteration

    def flush(self):
        """Write all remaining lines from buffer file to the output file"""
        while self.buffer:
            self.write_line()

    def check_for_name(self, name):
        """Depending on length of name, check if the first n items in the buffer
        match name."""
        # check if tagged
        if self.buffer[0][-1] == 'SCI':
            return
        name = name.strip().split()
        name = [n + '-n' for n in name]
        n = len(name)
        # check if they match
        candidate = [line[2] for line in self.buffer[:n]]
        if name == candidate:
            # edit the tags in the first n items in the buffer if they do
            for i in range(n):
                self.buffer[i][-1] += "SCI%i" % (i + 1)
        # check if other names in the dictionary match 

def main():
    mwutagger = MWUTagger('zenodo_test_untag.vert', 'zenodomwutagged.vert', 2,'JSON_file_test.json')
    while True:
        try:
            for name in mwutagger.names:
                mwutagger.check_for_name(name)
                mwutagger.__next__()
        except StopIteration:
            break

if __name__ == '__main__':
    main ()

我不确定是否需要在check_for_name模块中添加一些内容来说明候选=！名称，转到下一个名称直到列表的末尾，直到最后一个列表然后只是打印，或者如果在主方法中没有充分处理它。

有人能就此提出建议吗？

Python for循环不按预期迭代

0 个答案: