我有一大堆JSON数据格式如下:
[
[{
"created_at": "2017-04-28T16:52:36Z",
"as_of": "2017-04-28T17:00:05Z",
"trends": [{
"url": "http://twitter.com/search?q=%23ChavezSigueCandanga",
"query": "%23ChavezSigueCandanga",
"tweet_volume": 44587,
"name": "#ChavezSigueCandanga",
"promoted_content": null
}, {
"url": "http://twitter.com/search?q=%2327Abr",
"query": "%2327Abr",
"tweet_volume": 79781,
"name": "#27Abr",
"promoted_content": null
}],
"locations": [{
"woeid": 395277,
"name": "Turmero"
}]
}],
[{
"created_at": "2017-04-28T16:57:35Z",
"as_of": "2017-04-28T17:00:03Z",
"trends": [{
"url": "http://twitter.com/search?q=%23fyrefestival",
"query": "%23fyrefestival",
"tweet_volume": 141385,
"name": "#fyrefestival",
"promoted_content": null
}, {
"url": "http://twitter.com/search?q=%23HotDocs17",
"query": "%23HotDocs17",
"tweet_volume": null,
"name": "#HotDocs17",
"promoted_content": null
}],
"locations": [{
"woeid": 9807,
"name": "Vancouver"
}]
}]
]...
我编写了一个函数,将其格式化为采用以下形式的pandas数据框:
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| | name | promoted_content | query | tweet_volume | url | as_of | created_at | location_name | location_woeid |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
| 47 | #BatesMotel | | %23BatesMotel | 59748 | http://twitter.com/search?q=%23BatesMotel | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 48 | #AdviceForPeopleJoiningTwitter | | %23AdviceForPeopleJoiningTwitter | 51222 | http://twitter.com/search?q=%23AdviceForPeopleJoiningTwitter | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 49 | #CADTHSymp | | %23CADTHSymp | | http://twitter.com/search?q=%23CADTHSymp | 2017-04-25T17:00:05Z | 2017-04-25T16:53:43Z | Winnipeg | 2972 |
| 0 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 1 | #TravelTuesday | | %23TravelTuesday | | http://twitter.com/search?q=%23TravelTuesday | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| 2 | #DigitalLeap | | %23DigitalLeap | | http://twitter.com/search?q=%23DigitalLeap | 2017-04-25T17:00:05Z | 2017-04-25T16:58:22Z | Toronto | 4118 |
| … | … | … | … | … | … | … | … | … | … |
| 0 | #nusnc17 | | %23nusnc17 | | http://twitter.com/search?q=%23nusnc17 | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 1 | #WorldPenguinDay | | %23WorldPenguinDay | 79006 | http://twitter.com/search?q=%23WorldPenguinDay | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
| 2 | #littleboyblue | | %23littleboyblue | 20772 | http://twitter.com/search?q=%23littleboyblue | 2017-04-25T17:00:05Z | 2017-04-25T16:58:24Z | Birmingham | 12723 |
+----+--------------------------------+------------------+----------------------------------+--------------+--------------------------------------------------------------+----------------------+----------------------+---------------+----------------+
这是将JSON写入DataFrame的函数:
def trends_to_dataframe(data):
df = pd.DataFrame()
for location in data:
temp_df = pd.DataFrame()
for trend in location[0]['trends']:
temp_df = temp_df.append(pd.Series(trend), ignore_index=True)
temp_df['as_of'] = location[0]['as_of']
temp_df['created_at'] = location[0]['created_at']
temp_df['location_name'] = location[0]['locations'][0]['name']
temp_df['location_woeid'] = location[0]['locations'][0]['woeid']
df = df.append(temp_df)
return df
不幸的是,由于我拥有的数据量(以及我测试的一些简单计时器),这将需要大约4个小时才能完成。有关如何加快速度的想法吗?
答案 0 :(得分:3)
您可以通过使用concurrent.futures异步展平数据来加快速度,然后将其全部加载到from_records的数据框中。
from concurrent.futures import ThreadPoolExecutor
def get_trends(location):
trends = []
for trend in location[0]['trends']:
trend['as_of'] = location[0]['as_of']
trend['created_at'] = location[0]['created_at']
trend['location_name'] = location[0]['locations'][0]['name']
trend['location_woeid'] = location[0]['locations'][0]['woeid']
trends.append(trend)
return trends
flat_data = []
with ThreadPoolExecutor() as executor:
for location in data:
flat_data += get_trends(location)
df = pd.DataFrame.from_records(flat_data)