Question

这是使用telethon API运行python脚本后作为输出得到的json文件的一部分。

[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}

如您所见，python脚本已从电报中的特定频道抓取了聊天记录。我需要的是将json的日期和消息部分存储到单独的数据框中，以便我可以应用适当的过滤器并提供适当的输出。有人可以帮我吗？

Answer 1

我认为您应该使用json加载，然后使用json_normalize将json转换为具有max_level的嵌套字典的数据框。

from pandas import json_normalize
import json
d = '[{"_": "Message", "id": 4589, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T14:51:03+00:00", "message": "Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same", "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "from_scheduled": false, "legacy": false, "edit_hide": false, "from_id": 356886523, "fwd_from": null, "via_bot_id": null, "reply_to_msg_id": null, "media": null, "reply_markup": null, "entities": [], "views": null, "edit_date": null, "post_author": null, "grouped_id": null, "restriction_reason": []}, {"_": "MessageService", "id": 4588, "to_id": {"_": "PeerChannel", "channel_id": 1399858792}, "date": "2020-09-03T11:48:18+00:00", "action": {"_": "MessageActionChatJoinedByLink", "inviter_id": 310378430}, "out": false, "mentioned": false, "media_unread": false, "silent": false, "post": false, "legacy": false, "from_id": 1264437394, "reply_to_msg_id": null}]'
f = json.loads(d)
print(json_normalize(f, max_level=2))

Answer 2

这假设从API返回的对象不是字符串（例如'[{...}, {...}]'。
- 如果是字符串，请首先使用data = json.loads(data)。
可以通过列表理解从'date'的{{1}}中提取'message'和相应的list。
遍历dicts中的每个dict，并将list用于dict.get。如果密钥不存在，则返回key。

None

或者

如果您希望跳过数据，则import pandas as pd # where data is the list of dicts, unpack the desired keys and load into pandas df = pd.DataFrame([{'date': i.get('date'), 'message': i.get('message')} for i in data]) # display(df) date message 0 2020-09-03T14:51:03+00:00 Looking for product managers / engineers who have worked in search engine / query understanding space. Please PM me if you can connect me to someone for the same 1 2020-09-03T11:48:18+00:00 None是'message'

None

如何从字典列表中提取数据到熊猫数据框？

2 个答案:

或者