我有一个包含500多个行的csv,其中一列“ _source”存储为JSON。我想将其提取到pandas数据框中。我需要每个键成为其自己的列。
我有一个1mb的在线社交媒体数据JSON文件,我需要将字典和键值转换为它们自己的单独列。社交媒体数据来自Facebook,Twitter /网络抓取...等。
大约528行独立的帖子/推文/文本行,每行在词典中都有很多词典。
我将在下面的Jupyter笔记本上附加几个步骤,以提供更完整的理解。我需要将词典中词典的所有键值对都转换为数据框内的列。
我尝试通过执行此操作将其更改为数据框
source = pd.DataFrame.from_dict(source, orient='columns')
它返回类似这样的内容...我认为它可能会解开字典的包装,但没有。
source.head()
_source
0 {'sub_organization_id': 'default', 'uid': 'aba...
1 {'sub_organization_id': 'default', 'uid': 'ab0...
2 {'sub_organization_id': 'default', 'uid': 'ac0...
下面是形状
source.shape
(528, 1)
以下是“ _source”的示例行。字典和键:值对很多,其中每个键都必须是自己的列。
{
'sub_organization_id': 'default',
'uid': 'ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b',
'project_veid': 'default',
'campaign_id': 'default',
'organization_id': 'default',
'meta': {
'rule_matcher': [{
'atribs': {
'website': 'github.com/res',
'source': 'Explicit',
'version': '1.1',
'type': 'crawl'
},
'results': [{
'rule_type': 'hashtag',
'rule_tag': 'Far',
'description': None,
'project_veid': 'A7180EA-7078-0C7F-ED5D-86AD7',
'campaign_id': '2A6DA0C-365BB-67DD-B05830920',
'value': '#Far',
'organization_id': None,
'sub_organization_id': None,
'appid': 'ray',
'project_id': 'CDE2F42-5B87-C594-C900E578C',
'rule_id': '1838',
'node_id': None,
'metadata': {
'campaign_title': 'AF',
'project_title': 'AF '
}
}
]
}
],
'render': [{
'attribs': {
'website': 'github.com/res',
'version': '1.0',
'type': 'Page Render'
},
'results': [{
'render_status': 'success',
'path': 'https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg',
'image_hash': 'bb7674b8ea3fc05bfd027a19815f82c',
'url': 'https://discooprdapp.com/',
'load_time': 32
}
]
}
]
},
'norm_attribs': {
'website': 'github.com/res',
'version': '1.1',
'type': 'crawl'
},
'project_id': 'default',
'system_timestamp': '2019-02-22T19:04:53.569623',
'doc': {
'appid': 'subtter',
'links': [],
'response_url': 'https://discooprdapp.com',
'url': 'https://discooprdapp.com/',
'status_code': 200,
'status_msg': 'OK',
'encoding': 'utf-8',
'attrs': {
'uid': '2ab8f2651cb32261b911c990a8b'
},
'timestamp': '2019-02-22T19:04:53.963',
'crawlid': '7fd95-785-4dd259-fcc-8752f'
},
'type': 'crawl',
'norm': {
'body': '\n',
'domain': 'discordapp.com',
'author': 'crawl',
'url': 'https://discooprdapp.com',
'timestamp': '2019-02-22T19:04:53.961283+00:00',
'id': '7fc5-685-4dd9-cc-8762f'
}
}
答案 0 :(得分:1)
_source
:_source
至list
:list
中的所有行中创建_source
_source_list = df._source.tolist()
dicts
_source_list
def flatten_json(nested_json: dict, exclude: list=['']) -> dict:
"""
Flatten a list of nested dicts.
"""
out = dict()
def flatten(x: (list, dict, str), name: str='', exclude=exclude):
if type(x) is dict:
for a in x:
if a not in exclude:
flatten(x[a], f'{name}{a}_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, f'{name}{i}_')
i += 1
else:
out[name[:-1]] = x
flatten(nested_json)
return out
flatten_json
:df_source = pd.DataFrame([flatten_json(x) for x in _source_list])
sub_organization_id uid project_veid campaign_id organization_id meta_rule_matcher_0_atribs_website meta_rule_matcher_0_atribs_source meta_rule_matcher_0_atribs_version meta_rule_matcher_0_atribs_type meta_rule_matcher_0_results_0_rule_type meta_rule_matcher_0_results_0_rule_tag meta_rule_matcher_0_results_0_description meta_rule_matcher_0_results_0_project_veid meta_rule_matcher_0_results_0_campaign_id meta_rule_matcher_0_results_0_value meta_rule_matcher_0_results_0_organization_id meta_rule_matcher_0_results_0_sub_organization_id meta_rule_matcher_0_results_0_appid meta_rule_matcher_0_results_0_project_id meta_rule_matcher_0_results_0_rule_id meta_rule_matcher_0_results_0_node_id meta_rule_matcher_0_results_0_metadata_campaign_title meta_rule_matcher_0_results_0_metadata_project_title meta_render_0_attribs_website meta_render_0_attribs_version meta_render_0_attribs_type meta_render_0_results_0_render_status meta_render_0_results_0_path meta_render_0_results_0_image_hash meta_render_0_results_0_url meta_render_0_results_0_load_time norm_attribs_website norm_attribs_version norm_attribs_type project_id system_timestamp doc_appid doc_response_url doc_url doc_status_code doc_status_msg doc_encoding doc_attrs_uid doc_timestamp doc_crawlid type norm_body norm_domain norm_author norm_url norm_timestamp norm_id
default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res Explicit 1.1 crawl hashtag Far None A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far None None ray CDE2F42-5B87-C594-C900E578C 1838 None AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32 github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res Explicit 1.1 crawl hashtag Far None A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far None None ray CDE2F42-5B87-C594-C900E578C 1838 None AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32 github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res Explicit 1.1 crawl hashtag Far None A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far None None ray CDE2F42-5B87-C594-C900E578C 1838 None AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32 github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
default ac0fafe9ba98327f2d0c72ddc365ffb76336czsa13280b default default default github.com/res Explicit 1.1 crawl hashtag Far None A7180EA-7078-0C7F-ED5D-86AD7 2A6DA0C-365BB-67DD-B05830920 #Far None None ray CDE2F42-5B87-C594-C900E578C 1838 None AF AF github.com/res 1.0 Page Render success https://east.amanaws.com/rays-ime-store/renders/b/b/70f7dffb8b276f2977f8a13415f82c.jpeg bb7674b8ea3fc05bfd027a19815f82c https://discooprdapp.com/ 32 github.com/res 1.1 crawl default 2019-02-22T19:04:53.569623 subtter https://discooprdapp.com https://discooprdapp.com/ 200 OK utf-8 2ab8f2651cb32261b911c990a8b 2019-02-22T19:04:53.963 7fd95-785-4dd259-fcc-8752f crawl \n discordapp.com crawl https://discooprdapp.com 2019-02-22T19:04:53.961283+00:00 7fc5-685-4dd9-cc-8762f
答案 1 :(得分:0)
pd.io.json.json_normalize(source.columnName.apply(json.loads))