我的数据在列表中。
data = [['Biz_Innovations', '#socialmedia'],
['ChantalGrange', '#aws'],
['beyonddevops', '#aws'],
['beyonddevops', '#socialmedia'],
['IBMNetezza', '#ibm'],
['IBMNetezza', '#analytics'],
['SandraFeinsmith', '#ibm'],
['SandraFeinsmith', '#analytics'],
['fleejack', '#healhcare'],
['bigdataweek', '#socialmedia'],
['sabumjung', '#aws']]
我想计算第二列中单词的频率(例如#socialmedia,#aws),然后根据该频率选择行。如果单词在数据集中出现三次或更多次,我想保留相应的行(并删除其他行)。所以结果看起来像这样:
data = [['Biz_Innovations', '#socialmedia'],
['ChantalGrange', '#aws'],
['beyonddevops', '#aws'],
['beyonddevops', '#socialmedia'],
['bigdataweek', '#socialmedia'],
['sabumjung', '#aws']]
有什么建议吗?
答案 0 :(得分:2)
>>> import collections, operator
>>> words = collections.Counter(map(operator.itemgetter(1), data))
>>> populars = [p for p in data if words[p[1]] >= 3]
答案 1 :(得分:1)
In [16]: from collections import Counter
In [17]: keepers = [a[0] for a in Counter(d[1] for d in data).items() if a[1]>=3]
In [18]: [d for d in data if d[1] in keepers]
Out[18]:
[['Biz_Innovations', '#socialmedia'],
['ChantalGrange', '#aws'],
['beyonddevops', '#aws'],
['beyonddevops', '#socialmedia'],
['bigdataweek', '#socialmedia'],
['sabumjung', '#aws']]
答案 2 :(得分:1)
您可以使用collections.Counter
:
import collections
counts = collections.Counter(tag for (_, tag) in data)
data = [[val, tag] for (val, tag) in data if counts[tag] >= 3]