Question

我有一个这样的文件，包含句子，标记为BOS（句子开头）和EOS（句子结尾）：

BOS 1
1 word \t\t word \t word \t\t word \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t word \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t word \t 789
EOS 2

第二个文件，第一个数字显示句号：

1, 123, 567
2, 789

我想要的是读取第一个和第二个文件，并检查每行末尾的数字是否出现在第二个文件中。如果是这样，我想只更改第一个文件行中的第四个单词。所以，预期的输出是：

BOS 1
1 word \t\t word \t word \t\t NEW_WORD \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t NEW_WORD \t 567
EOS 1

BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t NEW_WORD \t 789
EOS 2

首先，我不确定如何阅读这两个文件，因为它们的行数不同。然后，我不知道如何迭代这些行，例如第一个文件中的第一个句子，同时迭代第二个文件的第一行中的值进行比较。这就是我到目前为止所做的：

def readText(filename1, filename2):
  data1 = open(filename1).readlines()   # the first file

  data2 = open(filename2).readlines() # the second one

  list2 = [] # a list to store the values of the second file

  for line1, line2 in itertools.izip(data1, data2):
    l1 = line1.split()

    l2 = line2.split(', ')

    find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number

    for l in l2:
      list2.append(l)

    for match in find:
      m = match.split() # split the lines of the first file

      if (m[0] == list2[0]): # for the same sentence number in the two files 
        result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1) 

if len(sys.argv)==3: 
  lines = readText(sys.argv[1], sys.argv[2])
else:
  print("file.py inputfile1 inputfile2")

提前感谢您的帮助！

Answer 1

作为参考，我将第一个文件命名为source.txt，将第二个文件命名为control.txt，输出命名为result.txt。
这是该计划的骨架。

[modify_line(line) if line[0].isdigit() else line for line in source]

此代码完整传递或修改每一行。如果一行以数字开头，则传递给modify_line，返回修改的行或基于传递给它的行的原始行以及从control.txt获得的一些输入。 modify_line必须从control.txt获取数据以检查和更改传递给它的每一行。数据是起始编号和结束编号，例如[1, (123, 567)]。如果起始编号匹配且其中一个结束编号匹配，则更改该行。如果起始编号不匹配，则从控制文件中读取下一个起始编号，因为modify_line仅传递以编号开头的行。
为了保持状态，我在这里使用了闭包。

import re

def create_line_modification_function(fp, replacement_word):

    def get_line_number_and_end_numbers():
        for line in fp:
            if line.strip():
                line_number, rest = line.split(',', 1)
                line_number = line_number.strip()
                ends = [end.strip() for end in rest.split(',')]
                yield line_number, ends

    generate_line_numbers_and_ends = get_line_number_and_end_numbers()
    # modify_line needs to change this. So this is in a list
    line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None)))
    # for safety check if we run out of line numbers in the control file
    if line_number_and_ends[0] is None:
        raise ValueError('{} reached EOF'.format(fp.name))
    # for optimization compile once here
    pattern = re.compile(r'(.*)word(.*\d{3}$)')


    def modify_line(line):
        while True:
            # for convenience unpack the list 
            line_number, ends = line_number_and_ends
            if line.startswith(line_number):
                for end in ends:
                    if line.rstrip().endswith(end):
                        return pattern.sub(r'\1{}\2'.format(replacement_word), line)
                return line
            # If we are here the line numbers from control.txt and source.txt don't match.
            # So we have to read next line from control file
            line_number_and_ends[0], line_number_and_ends[1]  = next(generate_line_numbers_and_ends, (None, None))
            if line_number_and_ends[0] is None:
                raise ValueError('{} reached EOF'.format(fp.name))

    return modify_line

if __name__ == '__main__':

    with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target:
        modify_line = create_line_modification_function(ctl, 'NEW_WORD')
        target.writelines(modify_line(line) if line[0].isdigit() else line for line in source)

考虑到第二个文件中的数据，在文件中读取和替换

1 个答案: