我正在尝试将这些句子相互比较。例如,我想看看BEFORE
是否与BEFORE THE
相同,但显然不同。但是,问题是我想遍历换行符,所以
BEFORE THE PARLIAMENT ON BRITAIN'S RELATIONS
仅包含一个字符串。下面是一个示例文件。
BEFORE
BEFORE THE
BEFORE THE PARLIAMENT
BEFORE THE PARLIAMENT ON
BEFORE THE PARLIAMENT ON
BRITAIN'S
BEFORE THE PARLIAMENT ON
BRITAIN'S RELATIONS
BEFORE THE PARLIAMENT ON
BRITAIN'S RELATIONS WITH
我现在做的方式遍历每一行。因此,当句子多于一行时,它将拆分所有内容。
with open("test.txt") as f:
data = f.readlines()
data = [d.strip().split('\n') for d in data]
如何遍历此文件,并逐个获取每个句子,而不是遍历每一行?
答案 0 :(得分:2)
在双换行符上进行拆分,例如:
with open("test.txt") as f:
data = f.read()
data = [d.strip().split('\n\n') for d in data]
答案 1 :(得分:2)
with open("test.txt") as f:
text = f.read()
for line in text.split("\n\n"):
line = line.replace("\n", " ")
print(line)
我想这就是你想要的。您可以将两个换行符分开,然后用空格替换换行符。
输出:
BEFORE
BEFORE THE
BEFORE THE PARLIAMENT
BEFORE THE PARLIAMENT ON
BEFORE THE PARLIAMENT ON BRITAIN'S
BEFORE THE PARLIAMENT ON BRITAIN'S RELATIONS
BEFORE THE PARLIAMENT ON BRITAIN'S RELATIONS WITH
答案 2 :(得分:1)
您可以用双换行符隔开
data = f.read().split('\n\n')
但是,必须确保空白行不包含任何字符(空格)。
答案 3 :(得分:0)
使用itertools.groupby
的一个版本。这将适用于句子之间的任意数量的换行符:
from itertools import groupby
with open('file.txt', 'r') as f_in:
txt = f_in.read()
out = []
for v, g in groupby(txt.splitlines(), lambda k: k != ''):
if v:
out.append(' '.join(g))
from pprint import pprint
pprint(out)
打印:
['BEFORE',
'BEFORE THE',
'BEFORE THE PARLIAMENT',
'BEFORE THE PARLIAMENT ON',
"BEFORE THE PARLIAMENT ON BRITAIN'S",
"BEFORE THE PARLIAMENT ON BRITAIN'S RELATIONS",
"BEFORE THE PARLIAMENT ON BRITAIN'S RELATIONS WITH"]