我一直在尝试检测单词/ bigram在文本上的趋势。到目前为止我所做的是删除停用词,降低小写和获取单词频率,并将每个文本的最常见的30个附加到列表中,
e.g。
[(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1),...]
然后我将上面的列表转换成一个包含所有单词及其每个文档频率的巨大列表,我现在需要做的是返回一个排序列表,即:
[(u'snow', 32), (u'said.', 12), (u'GoT', 10), (u'death', 8), (u'entertainment', 4)..]
有什么想法吗?
代码:
fdists = []
for i in texts:
words = FreqDist(w.lower() for w in i.split() if w.lower() not in stopwords)
fdists.append(words.most_common(30))
all_in_one = [item for sublist in fdists for item in sublist]
答案 0 :(得分:0)
如果您只想对列表进行排序,则可以使用
import operator
fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
fdists += fdists2
fdict = {}
for i in fdists:
if i[0] in fdict:
fdict[i[0]] += i[1]
else:
fdict[i[0]] = i[1]
sorted_f = sorted(fdict.items(), key=operator.itemgetter(1), reverse=True)
print sorted_f[:30]
[(u'said.', 6), (u'seeing', 5), (u'death', 4), (u'entertainment', 4), (u'read', 4), (u'it\u2019s', 4), (u'weiss', 4), (u'one', 4), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
您可以处理重复项的另一种方法是使用pandas groupby()
函数,然后使用sort()
函数按count
和word
排序
from pandas import *
import pandas as pd
fdists = [(u'seeing', 2), (u'said.', 2), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2), (u'\u201cit', 1), (u'shot', 1), (u'show\u2019s', 1), (u'people', 1), (u'dead,\u201d', 1), (u'bloody', 1)]
fdists2 = [(u'seeing', 3), (u'said.', 4), (u'one', 2), (u'death', 2), (u'entertainment', 2), (u'it\u2019s', 2), (u'weiss', 2), (u'read', 2)]
fdists += fdists2
df = DataFrame(data = fdists, columns = ['word','count'])
df= DataFrame([{'word': k, 'count': (v['count'].sum())} for k,v in df.groupby(['word'])], columns = ['word','count'])
Sorted = df.sort(['count','word'], ascending = [0,1])
print Sorted[:30]
word count
8 said. 6
9 seeing 5
2 death 4
3 entertainment 4
4 it’s 4
5 one 4
7 read 4
12 weiss 4
0 bloody 1
1 dead,” 1
6 people 1
10 shot 1
11 show’s 1
13 “it 1