如何使用nltk

时间:2016-05-03 17:35:18

标签: python python-3.x nltk word

from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
fileName = input("Enter file name: ")
f = open(fileName)
raw = f.read()


tokenizer = RegexpTokenizer(r'\w+')

这样就省去了标点符号,只保留了单词     print(tokenizer.tokenize(raw))//这将打印所有单词     打印(sent_tokenize(原始))

print('number of sentences equal',len(sent_tokenize(raw)))
print('number of words equal',len(tokenizer.tokenize(raw)))


average =(len(tokenizer.tokenize(raw))/len(sent_tokenize(raw)))
print('average word per senetnces eqauls',average)

2 个答案:

答案 0 :(得分:1)

正如@bkm回答指出你可以使用这个:

long_words = [wrd for wrd in tokenizer.tokenize(raw) if len(wrd) >= 3]

但是,如果你想要的是删除像"和","","如果",""等等你应该用停用词过滤它们:

from nltk.corpus import stopwords

sw = stopwords.words('english')
# sw will contain a list of stopwords (and, the, unless, about, etc.)
# filter them out like this:
tokens = [t for t in tokenizer.tokenize(raw) if t not in sw]

答案 1 :(得分:0)

我怀疑你正在寻找像

这样的东西
long_words = [wrd for wrd in tokenizer.tokenize(raw) if len(wrd) > 2]

如果您发现多行for循环更容易理解,则上面的列表理解等同于:

long_words = []
for wrd in tokenizer.tokenize(raw):
    if len(wrd) > 2:
        long_words.append(wrd)

如果您正在寻找精确意义上的3个或更多字母(即不是数字),那么if子句可以是:

len([chr for chr in wrd if chr.isalpha()]) > 2

最后,如果您只想捕获包含3个或更多字符的字词,则可以将r'r\w+'更改为r'\w{3,}'