我有一些来自流API的JSON Twitter数据,我想使用Counter
函数来了解此数据集中最受欢迎的主题标签。我所遇到的问题是循环通过具有多个主题标签的推文,而不仅仅是拔出第一个主题标签并忽略任何剩余的主题标签。
问题:如何在dict中循环嵌套列表以提取推文中的所有主题标签,而不仅仅是第一个#标签?
In [1]: import json
In [2]: from collections import Counter
In [3]: data = []
In [4]: for line in open('DC.json'):
...: try:
...: data.append(json.loads(line))
...: except:
...: pass
...:
In [5]: hashtags = []
In [6]: for i in data:
...: if 'entities' in i and len(i['entities']['hashtags']) > 0:
...: hashtags.append(i['entities']['hashtags']['text'])
...: else:
...: pass
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-6-66d7538509f9> in <module>()
1 for i in data:
2 if 'entities' in i and len(i['entities']['hashtags']) > 0:
----> 3 hashtags.append(i['entities']['hashtags']['text'])
4 else:
5 pass
TypeError: list indices must be integers, not str
In [7]: Counter(hashtags).most_common()[:10]
i['entities']['hashtags']
In [12]: i[0]['entities']['hashtags']
Out[12]:
[{u'indices': [28, 35], u'text': u'selfie'},
{u'indices': [82, 92], u'text': u'omg'},
{u'indices': [93, 104], u'text': u'Champ'},
{u'indices': [105, 117], u'text': u'FIRST'}]
答案 0 :(得分:4)
您说i['entities']['hashtags']
是list
的{{1}},所以行:
dict
正在尝试使用字符串索引列表。这使无意义,并导致错误。我认为你最好把它分成几步,首先得到所有hashtags.append(i['entities']['hashtags']['text'])
词典:
'hashtag'
然后提取hashtags = []
for i in data:
if 'entities' in i:
hashtags.extend(i['entities']['hashtags'])
:
'text'
然后将其转储到hashtags = [tag['text'] for tag in hashtags]
:
Counter