I want to say that Napp Granade
serves in the spirit of a town in our dis-
trict of Georgia called Andersonville.
我有数以千计的文本文件,其中包含上述数据,并且使用连字符和换行符包装了单词。
我要做的是删除连字符并将换行符放在单词的末尾。我不想删除所有带连字符的单词,如果可能的话只删除那些位于行末的单词。
with open(filename, encoding="utf8") as f:
file_str = f.read()
re.sub("\s*-\s*", "", file_str)
with open(filename, "w", encoding="utf8") as f:
f.write(file_str)
以上代码无效,我尝试过几种不同的方法。
我想浏览整个文本文件并删除所有表示换行符的连字符。如:
I want to say that Napp Granade
serves in the spirit of a town in our district
of Georgia called Andersonville.
任何帮助都将不胜感激。
答案 0 :(得分:3)
您不需要使用正则表达式:
filename = 'test.txt'
# I want to say that Napp Granade
# serves in the spirit of a town in our dis-
# trict of Georgia called Anderson-
# ville.
with open(filename, encoding="utf8") as f:
lines = [line.strip('\n') for line in f]
for num, line in enumerate(lines):
if line.endswith('-'):
# the end of the word is at the start of next line
end = lines[num+1].split()[0]
# we remove the - and append the end of the word
lines[num] = line[:-1] + end
# and remove the end of the word and possibly the
# following space from the next line
lines[num+1] = lines[num+1][len(end)+1:]
text = '\n'.join(lines)
with open(filename, "w", encoding="utf8") as f:
f.write(text)
# I want to say that Napp Granade
# serves in the spirit of a town in our district
# of Georgia called Andersonville.
但是你可以,当然,它更短:
with open(filename, encoding="utf8") as f:
text = f.read()
text = re.sub(r'-\n(\w+ *)', r'\1\n', text)
with open(filename, "w", encoding="utf8") as f:
f.write(text)
我们会找到-
后跟\n
,然后捕获以下单词,即分词的结尾。
我们用捕获的单词替换所有内容后跟换行符。
不要忘记使用原始字符串进行替换,以便正确解释\1
。