如何在Python中逐句解析文件

时间:2018-02-22 03:47:34

标签: python nltk tokenize

我需要阅读大量的大型文本文件。

对于每个文件,我需要打开它并逐句阅读文本。

我发现的大多数方法都是逐行阅读的。

我怎样才能用Python做到这一点?

2 个答案:

答案 0 :(得分:3)

如果你想要句子标记化,nltk可能是最快的方法。 http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize.punkt 会让你相当远。

即。来自docs的代码

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))


Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

答案 1 :(得分:-2)

如果文件包含大量行,则可以使用yield语句

创建生成器
def read(filename):
    file = open(filename, "r")
    for line in file.readlines():
        for word in line.split():
            yield word

for word in read("sample.txt"):
    print word

这将返回文件每行的所有单词