我目前正致力于与自然语言处理和文本挖掘相关的项目,我已经写下了一个代码来计算文本文件中唯一单词的频率。
Frequencey of: trypanosomiasis --> 0.0029
Frequencey of: deadly --> 0.0029
Frequencey of: yellow --> 0.0029
Frequencey of: humanassociated --> 0.0029
Frequencey of: successful --> 0.0029
Frequencey of: potential --> 0.0058
Frequencey of: which --> 0.0029
Frequencey of: cholera --> 0.01449
Frequencey of: antimicrobial --> 0.0029
Frequencey of: hostdirected --> 0.0029
Frequencey of: cameroon --> 0.0029
是否有任何图书馆或方法可以删除常用词,形容词帮助动词等(Exm。"哪个","潜力",这个""" ;等)来自文本文件,以便我可以探索或计算最可能出现的科学术语到文本数据中。
答案 0 :(得分:2)
通常在文本分析中,您会删除停用词 - 对文本没什么意义的常用词。这些你可以使用nltk的停用词(来自https://pythonspot.com/en/nltk-stop-words/)删除:
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
data = "All work and no play makes jack dull boy. All work and no play makes jack a dull boy."
stopWords = set(stopwords.words('english'))
words = word_tokenize(data)
wordsFiltered = []
for w in words:
if w not in stopWords:
wordsFiltered.append(w)
print(wordsFiltered)
如果您要删除其他字词,只需将其添加到集stopWords