NLTK-Python:如何格式化原始文本

时间:2019-01-11 01:51:30

标签: python nltk

您知道我是否可以使用NLTK(或任何其他NLP)和Python格式化原始文本(无标点,段落之间没有大写或换行符)?

我已经阅读了文档,但是找不到任何可以帮助我完成此任务的东西。

示例:

输入:

python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community

输出:

Python is an interpreted, high-level, general-purpose programming language. Created by Guido van Rossum and first released in 1991, Python has a design philosophy that emphasizes code readability, notably using significant whitespace. It provides constructs that enable clear programming on both small and large scales. In July 2018, Van Rossum stepped down as the leader in the language community.

谢谢

1 个答案:

答案 0 :(得分:1)

有趣的问题。至于边界的插入,您可以训练NLTK的令牌生成器(或句子拆分器)(如果您使用google,可以使用大量文档)。您可以尝试做的一件事是获取一些被句子拆分的文本,删除标点符号,然后进行训练并查看所得到的内容。 如下所示(如下)。如前所述,该算法可能在很大程度上依赖标点符号,在任何情况下,下面的代码均不适用于您的例句,但是,如果您使用其他/更大/不同的领域训练文字,则可能值得尝试。不能完全确定这是否也适用于插入逗号和其他(非句子结尾/首字母)标点。

from nltk.corpus import gutenberg
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
import re

text = ""
for file_id in gutenberg.fileids():
    text += gutenberg.raw(file_id)
# remove punctuation
text = re.sub('[\.\?!]\n', '\n', text) #  you will probably want to include some other potential sentence final punctuation here
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
tokenizer = PunktSentenceTokenizer(trainer.get_params())
sentences = "python is an interpreted high-level general-purpose programming language created by guido van rossum and first released in 1991 python has a design philosophy that emphasizes code readability notably using significant whitespace it provides constructs that enable clear programming on both small and large scales in July 2018, van rossum stepped down as the leader in the language community"
 print(tokenizer.tokenize(sentences))