我弄清楚了如何使用tfidf模式捕获单词在文档中的分布。但是,我想为句子列表创建最常用和最不常用的单词。
这是文本预处理的一部分:
print(my.df) ->
(17298, 2)
print(df.columns) ->
Index(['screen_name', 'text'], dtype='object')
txt = re.sub(r"[^\w\s]","",txt)
txt = re.sub(r"@([A-Z-a-z0-9_]+)", "", txt)
tokens = nltk.word_tokenize(txt)
token_lemmetized = [lemmatizer.lemmatize(token).lower() for token in tokens]
df['text'] = df['text'].apply(lambda x: process(x))
然后这是我的第二次尝试:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import string
stop = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda x: [item for item in x if item not in stop])
all_words = list(chain.from_iterable(df['text']))
for i in all_words:
x=Counter(df['text'][i])
res= [word for word, count in x.items() if count == 1]
print(res)
在上述方法中,我想从句子列表中创建最频繁和最不频繁的单词,但是我的尝试没有产生这种结果吗?我该怎么办?任何优雅的方法来实现这一目标?任何想法?谁能给我可能的想法来实现这一目标?谢谢
示例数据段:
这是我使用过的数据,可以在这里安全找到文件:example data
示例输入和输出:
inputList = {“ RT @GOPconvention:#俄勒冈州今天投票。这意味着@GOPconvention之前还有62天!”,“ RT @DWStweets:2016年的选择很明确:我们需要白宫中的另一位民主党人。#DemDebate #WeAreDemocrats“,“特朗普呼吁华尔街减税数万亿美元。”,从查塔姆镇议会到国会,@ RepRobertHurt在他的社区中赢得了巨大的成功。代表VA一起为我们的工作感到骄傲!}
令牌的示例输出
['rt', 'gopconvention', 'oregon', 'vote', 'today', 'that', 'mean', '62', 'day', 'until', 'gopconvention', 'http', 't', 'co', 'ooh9fvb7qs']
输出:
我想为给出数据中最频繁的单词和最不频繁的单词创建词汇。有什么想法可以做到这一点吗?谢谢
答案 0 :(得分:3)
collections.Counter()
可以为您做到这一点。我无法访问您的数据链接,但是复制并粘贴您作为示例发布的文本,这是可以完成的操作:
>>> import collections
>>> s = "in above approach I want to create most frequent and least frequent
words from list of sentences, but my attempt didn't produce that outuput?
what should I do? any elegant way to make this happen? any idea? can anyone
give me possible idea to make this happen? Thanks"
>>> c = dict(collections.Counter(s.split()))
>>> c
{'in': 1, 'above': 1, 'approach': 1, 'I': 2, 'want': 1, 'to': 3, 'create': 1,
'most': 1, 'frequent': 2, 'and': 1, 'least': 1, 'words': 1, 'from': 1,
'list': 1, 'of': 1, 'sentences,': 1, 'but': 1, 'my': 1, 'attempt': 1,
"didn't": 1, 'produce': 1, 'that': 1, 'outuput?': 1, 'what': 1, 'should': 1,
'do?': 1, 'any': 2, 'elegant': 1, 'way': 1, 'make': 2, 'this': 2, 'happen?':
2, 'idea?': 1, 'can': 1, 'anyone': 1, 'give': 1, 'me': 1, 'possible': 1,
'idea': 1, 'Thanks': 1}
>>> maxval = max(c.values())
>>> print([word for word in c if c[word] == maxval])
['to']
您首先要删除标点符号等。否则happen
和happen?
例如被计为两个不同的单词。但是您会注意到c
是一本字典,其中的键是单词,值是单词在字符串中出现的次数。
编辑:以下内容将与您拥有的多个Tweet列表一起使用。您可以使用正则表达式首先将每个Tweet简化为所有小写字母,没有标点符号等。
from collections import Counter
import re
fakenews = ["RT @GOPconvention: #Oregon votes today. That means 62 days until the @GOPconvention!",
"RT @DWStweets: The choice for 2016 is clear: We need another Democrat in the White House. #DemDebate #WeAreDemocrats ",
"Trump's calling for trillion dollar tax cuts for Wall Street.",
"From Chatham Town Council to Congress, @RepRobertHurt has made a strong mark on his community. Proud of our work together on behalf of VA!"]
big_dict = {}
for tweet in fakenews:
# Strip out any non-alphanumeric, non-whitespaces
pattern = re.compile('([^\s\w]|_)+')
tweet_simplified = pattern.sub('', tweet).lower()
# Get the word count for this Tweet, then add it to the main dictionary
word_count = dict(Counter(tweet_simplified.split()))
for word in word_count:
if word in big_dict:
big_dict[word] += word_count[word]
else:
big_dict[word] = word_count[word]
# Start with the most frequently used words, and count down.
maxval = max(big_dict.values())
print("Word frequency:")
for i in range(maxval,0,-1):
words = [w for w in big_dict if big_dict[w] == i]
print("%d - %s" % (i, ', '.join(words)))
输出:
Word frequency:
3 - the, for
2 - rt, gopconvention, on, of
1 - oregon, votes, today, that, means, 62, days, until, dwstweets, choice, 2016, is, clear, we, need, another, democrat, in, white, house, demdebate, wearedemocrats, trumps, calling, trillion, dollar, tax, cuts, wall, street, from, chatham, town, council, to, congress, reproberthurt, has, made, a, strong, mark, his, community, proud, our, work, together, behalf, va