Question

我正在用一些文本数据来训练自己，试图对它进行一些简单的操作。最初，“数据”一词的出现频率为7，但随后我在同一文本上发现了更多与“数据”相关的词，因此我降低了所有文本以获取缺少的词。 “数据”的最终频率仅为3。有人可以帮我吗？

tour_category_id

数据7

项目5

人4

系统4

高4

## First Word Frequency calculation:

from nltk.corpus import stopwords

import string

stop_list = stopwords.words('english') + list(string.punctuation)

tokens_no_stop = [token for token in tokens if token not in stop_list]

word_frequency_no_stop = Counter(tokens_no_stop)

for word, freq in word_frequency_no_stop.most_common(20):
     print(word, freq)

数据2

项目2

管理2

技能2

有人可以解释为什么吗？

Answer 1

您的代码有什么问题

all_tokens_lower = [t.lower() for t in word_frequency_no_stop]

在上面的行中，使用令牌代替word_frequency_no_stop。

您从

派生了word_frequency_no_stop的值

word_frequency_no_stop = Counter(tokens_no_stop)

返回一个字典，该字典每个单词只有一次。

在您的情况下，它让您数为2，因为它将有大写字母和小写字母。

例如word_frequency_no_stop = { 'Project': 7, 'project': 2}

所以其他单词的计数也将返回2。

使用以下代码

stop_list = stopwords.words('english') + list(string.punctuation)

tokens_no_stop = [token.lower() for token in tokens if token not in stop_list]

word_frequency_no_stop = Counter(tokens_no_stop)

for word, freq in word_frequency_no_stop.most_common(20):
     print(word, freq)

NLP-当我“降低”我的文字时，有些单词的频率降低而不是增加

1 个答案: