只有 len(mycorpus.sents(fileid))似乎不起作用(AssertionError),但是当我在Gutenberg上使用它时它确实有效:len(gutenberg.sents(fileid))。我试图找到我的语料库中每个文本的句子长度,以便我可以计算每个文本的平均句子长度。有没有办法用我自己的语料库实现这个目标?
from nltk.corpus import PlaintextCorpusReader
corpus_root = 'root'
mycorpus = PlaintextCorpusReader(corpus_root, '.*.txt', encoding='utf-8')
mycorpus.fileids()
for fileid in mycorpus.fileids():
num_chars = len(mycorpus.raw(fileid))
num_words = len(mycorpus.words(fileid))
num_sents = len(mycorpus.sents(fileid))
num_vocab = len(set([w.lower() for w in mycorpus.words(fileid)]))
print (num_chars / num_words), (num_words / num_sents), (num_words /
num_vocab), fileid