应用错误收集

只有 len（mycorpus.sents（fileid））似乎不起作用（AssertionError），但是当我在Gutenberg上使用它时它确实有效：len（gutenberg.sents（fileid））。我试图找到我的语料库中每个文本的句子长度，以便我可以计算每个文本的平均句子长度。有没有办法用我自己的语料库实现这个目标？

from nltk.corpus import PlaintextCorpusReader
corpus_root = 'root'
mycorpus = PlaintextCorpusReader(corpus_root, '.*.txt', encoding='utf-8')
mycorpus.fileids()
for fileid in mycorpus.fileids():
    num_chars = len(mycorpus.raw(fileid))
    num_words = len(mycorpus.words(fileid))
    num_sents = len(mycorpus.sents(fileid))
    num_vocab = len(set([w.lower() for w in mycorpus.words(fileid)]))
    print (num_chars / num_words), (num_words / num_sents), (num_words / 
    num_vocab), fileid

如何找到语料库中每个文本的句子长度？

0 个答案: