我正在尝试编写一个循环遍历名称列表的注释器,在出现这些名称时标记单独的文档。这些名称可以包含一个或两个单词。
程序上的缓冲区工作,因此它识别是否需要查看文件的一行或两行以进行标记,并在出现的名称与候选项完全匹配时进行标记。
但是,不是循环遍历列表中的每个候选项的所有名称,而是在该特定回合中的循环中取名称,如果名称与候选项不匹配,则写入该行并移至下一行(列表中的下一个名称)。这显然会导致文件中有许多名称,这些名称在应该标记时没有标记。
以下是我的代码:
import json
from tagging import import_names
def split_line(line):
"""Split a line into four parts, word, pos, lemma and tag."""
# TODO: Speak to Diana about the spaces in the vert file - do they mean
# anything?
line = line.strip().split()
if len(line) == 1:
word = line[0]
pos, lemma, tag = None, None, None
elif len(line) == 3:
word, pos, lemma = line
tag = ''
elif len(line) == 4:
word, pos, lemma, tag = line
return [word, pos, lemma, tag]
class MWUTagger(object):
"""Contains a buffer of lines split into word, pos, lemma, tag items."""
def __init__(self, f_in, f_out, n, gnrd_file, indicators=None):
"""Populate the buffer."""
# read the input vert file
self.f_in = open(f_in, 'r')
# populate the buffer (first n lines of the vert file)
self.buffer = []
for i in range(n):
self.buffer.append(split_line(self.f_in.readline()))
# read in list of names or save
self.names = import_names(gnrd_file)
# create the output vert file
self.f_out = f_out
def __iter__(self):
return self
def write_line(self):
"""Write out the oldest line in the buffer, and add a new line to the buffer."""
# write the oldest line from the buffer
tagged_line = self.buffer.pop(0)
tagged_line = [i for i in tagged_line if i]
with open(self.f_out, 'a') as f:
if tagged_line[0].startswith('<') and tagged_line[-1].endswith('>'):
f.write(' '.join(tagged_line) + '\n')
else:
f.write('\t'.join(tagged_line) + '\n')
def __next__(self):
"""write out the oldest line in the buffer and add a new line to the buffer"""
#write the oldest line from the buffer
self.write_line()
# add a new line to the buffer (found an example here https://bufferoverflow.com/a/14797993/1706564)
line = self.f_in.readline()
if line:
self.buffer.append(split_line(line))
else:
self.f_in.close()
self.flush()
raise StopIteration
def flush(self):
"""Write all remaining lines from buffer file to the output file"""
while self.buffer:
self.write_line()
def check_for_name(self, name):
"""Depending on length of name, check if the first n items in the buffer
match name."""
# check if tagged
if self.buffer[0][-1] == 'SCI':
return
name = name.strip().split()
name = [n + '-n' for n in name]
n = len(name)
# check if they match
candidate = [line[2] for line in self.buffer[:n]]
if name == candidate:
# edit the tags in the first n items in the buffer if they do
for i in range(n):
self.buffer[i][-1] += "SCI%i" % (i + 1)
# check if other names in the dictionary match
def main():
mwutagger = MWUTagger('zenodo_test_untag.vert', 'zenodomwutagged.vert', 2,'JSON_file_test.json')
while True:
try:
for name in mwutagger.names:
mwutagger.check_for_name(name)
mwutagger.__next__()
except StopIteration:
break
if __name__ == '__main__':
main ()
我不确定是否需要在check_for_name模块中添加一些内容来说明候选=!名称,转到下一个名称直到列表的末尾,直到最后一个列表然后只是打印,或者如果在主方法中没有充分处理它。
有人能就此提出建议吗?