我有一本书的文本文件,我希望它读入我的python程序,使用open("book.txt").read().split(".")
将其拆分为句子。
问题是文件有新的换行符和多个空格。我希望文件只是由空格分隔的单词,所有新行都只变成一个空格。
我的book.txt
目前就像这样(一个片段):
To Sherlock Holmes she is always the woman. I have seldom
heard him mention her under any other name. In his eyes she
eclipses and predominates the whole of her sex. It was not that
he felt any emotion akin to love for Irene Adler. All emotions,
and that one particularly, were abhorrent to his cold, precise but
admirably balanced mind. He was, I take it, the most perfect
reasoning and observing machine that the world has seen, but as
a lover he would have placed himself in a false position. He
never spoke of the softer passions, save with a gibe and a sneer.
答案 0 :(得分:1)
听起来你只想删除所有换行符和尾随空格......
也许像......
import re
sentences = [re.sub("^\s*|\s*$,"",re.sub("\n","",each)) for each in open("book.txt").read().split(".")]
或者标签也是一个问题......
sentences = [re.sub("^\s*|\s*$","",re.sub("\s+"," ",each)) for each in open("book.txt").read().split(".")]
也分为?,!或者。使用...
sentences = [re.sub("^\s*|\s*$","",re.sub("\s+"," ",each)) for each in re.split("[\?\.!]",open("book.txt").read())]