我有一个这样的文件,包含句子,标记为BOS(句子开头)和EOS(句子结尾):
BOS 1
1 word \t\t word \t word \t\t word \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t word \t 567
EOS 1
BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t word \t 789
EOS 2
第二个文件,第一个数字显示句号:
1, 123, 567
2, 789
我想要的是读取第一个和第二个文件,并检查每行末尾的数字是否出现在第二个文件中。如果是这样,我想只更改第一个文件行中的第四个单词。所以,预期的输出是:
BOS 1
1 word \t\t word \t word \t\t NEW_WORD \t 123
1 word \t\t word \t word \t\t word \t 234
1 word \t\t word \t word \t\t NEW_WORD \t 567
EOS 1
BOS 2
2 word \t\t word \t word \t\t word \t 456
2 word \t\t word \t word \t\t NEW_WORD \t 789
EOS 2
首先,我不确定如何阅读这两个文件,因为它们的行数不同。然后,我不知道如何迭代这些行,例如第一个文件中的第一个句子,同时迭代第二个文件的第一行中的值进行比较。这就是我到目前为止所做的:
def readText(filename1, filename2):
data1 = open(filename1).readlines() # the first file
data2 = open(filename2).readlines() # the second one
list2 = [] # a list to store the values of the second file
for line1, line2 in itertools.izip(data1, data2):
l1 = line1.split()
l2 = line2.split(', ')
find = re.findall(r'.*word\t\d\d\d', line1) # find the fourth word in a line, followed by a number
for l in l2:
list2.append(l)
for match in find:
m = match.split() # split the lines of the first file
if (m[0] == list2[0]): # for the same sentence number in the two files
result = re.sub(r'(.*)word\t%s' %m[5], r'\1NEW_WORD\t%s' %m[5],line1)
if len(sys.argv)==3:
lines = readText(sys.argv[1], sys.argv[2])
else:
print("file.py inputfile1 inputfile2")
提前感谢您的帮助!
答案 0 :(得分:0)
作为参考,我将第一个文件命名为source.txt,将第二个文件命名为control.txt,输出命名为result.txt。
这是该计划的骨架。
[modify_line(line) if line[0].isdigit() else line for line in source]
此代码完整传递或修改每一行。如果一行以数字开头,则传递给modify_line
,返回修改的行或基于传递给它的行的原始行以及从control.txt获得的一些输入。
modify_line
必须从control.txt获取数据以检查和更改传递给它的每一行。数据是起始编号和结束编号,例如[1, (123, 567)]
。如果起始编号匹配且其中一个结束编号匹配,则更改该行。如果起始编号不匹配,则从控制文件中读取下一个起始编号,因为modify_line
仅传递以编号开头的行。
为了保持状态,我在这里使用了闭包。
import re
def create_line_modification_function(fp, replacement_word):
def get_line_number_and_end_numbers():
for line in fp:
if line.strip():
line_number, rest = line.split(',', 1)
line_number = line_number.strip()
ends = [end.strip() for end in rest.split(',')]
yield line_number, ends
generate_line_numbers_and_ends = get_line_number_and_end_numbers()
# modify_line needs to change this. So this is in a list
line_number_and_ends = list(next(generate_line_numbers_and_ends, (None, None)))
# for safety check if we run out of line numbers in the control file
if line_number_and_ends[0] is None:
raise ValueError('{} reached EOF'.format(fp.name))
# for optimization compile once here
pattern = re.compile(r'(.*)word(.*\d{3}$)')
def modify_line(line):
while True:
# for convenience unpack the list
line_number, ends = line_number_and_ends
if line.startswith(line_number):
for end in ends:
if line.rstrip().endswith(end):
return pattern.sub(r'\1{}\2'.format(replacement_word), line)
return line
# If we are here the line numbers from control.txt and source.txt don't match.
# So we have to read next line from control file
line_number_and_ends[0], line_number_and_ends[1] = next(generate_line_numbers_and_ends, (None, None))
if line_number_and_ends[0] is None:
raise ValueError('{} reached EOF'.format(fp.name))
return modify_line
if __name__ == '__main__':
with open('source.txt') as source, open('control.txt') as ctl, open('result.txt', 'w') as target:
modify_line = create_line_modification_function(ctl, 'NEW_WORD')
target.writelines(modify_line(line) if line[0].isdigit() else line for line in source)