我从nltk抓取了文本语料库,现在想要处理它以确保文件中的每一行都以标点符号结束。
Her mother
had died too long ago for her to
remember her caresses; and her place had been supplied
by an excellent woman as governess, who had fallen little short
of a mother in affection.
应该成为:
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
如果在行尾没有标点符号,我尝试使用sed匹配,但无法弄清楚如何向上移动下一行。非常感谢任何帮助!
答案 0 :(得分:5)
如果您像这样使用paste
和sed
怎么办?
paste
打印同一行中的所有文字。
$ paste -s -d' ' file
Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
sed
在每.
和;
之后添加新行。
$ paste -s -d' ' file | sed -r 's/(\.|\;) /\1\n/g'
Her mother had died too long ago for her to remember her caresses;
and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.
答案 1 :(得分:3)
在Python中:
import string # for string.punctuation
with open("path/to/file") as f:
output = ""
for line in f:
sanitized = line.strip()
output += sanitized
if sanitized[-1] in string.punctuation:
output += "\n"
with
块终止后,output
将成为预期的文件。然后,如果需要,可以使用output
覆盖该文件。
答案 2 :(得分:0)
使用NLTK的sent_tokenize()
:
>>> from nltk import sent_tokenize
>>> text = """Her mother
... had died too long ago for her to
... remember her caresses; and her place had been supplied
... by an excellent woman as governess, who had fallen little short
... of a mother in affection."""
>>> sent_tokenize(text.replace("\n", " "))
['Her mother had died too long ago for her to remember her caresses; and her place had been supplied by an excellent woman as governess, who had fallen little short of a mother in affection.']