from nltk.tokenize import sent_tokenize, word_tokenize, RegexpTokenizer
fileName = input("Enter file name: ")
f = open(fileName)
raw = f.read()
tokenizer = RegexpTokenizer(r'\w+')
这样就省去了标点符号,只保留了单词 print(tokenizer.tokenize(raw))//这将打印所有单词 打印(sent_tokenize(原始))
print('number of sentences equal',len(sent_tokenize(raw)))
print('number of words equal',len(tokenizer.tokenize(raw)))
average =(len(tokenizer.tokenize(raw))/len(sent_tokenize(raw)))
print('average word per senetnces eqauls',average)
答案 0 :(得分:1)
正如@bkm回答指出你可以使用这个:
long_words = [wrd for wrd in tokenizer.tokenize(raw) if len(wrd) >= 3]
但是,如果你想要的是删除像"和","","如果",""等等你应该用停用词过滤它们:
from nltk.corpus import stopwords
sw = stopwords.words('english')
# sw will contain a list of stopwords (and, the, unless, about, etc.)
# filter them out like this:
tokens = [t for t in tokenizer.tokenize(raw) if t not in sw]
答案 1 :(得分:0)
我怀疑你正在寻找像
这样的东西long_words = [wrd for wrd in tokenizer.tokenize(raw) if len(wrd) > 2]
如果您发现多行for循环更容易理解,则上面的列表理解等同于:
long_words = []
for wrd in tokenizer.tokenize(raw):
if len(wrd) > 2:
long_words.append(wrd)
如果您正在寻找精确意义上的3个或更多字母(即不是数字),那么if子句可以是:
len([chr for chr in wrd if chr.isalpha()]) > 2
最后,如果您只想捕获包含3个或更多字符的字词,则可以将r'r\w+'
更改为r'\w{3,}'