我正在尝试从mongoDB集合转储创建数据框。
我已参考此question来规范化我的数据,但是它t help. The output doesn
中不包含文件名和ID。
我想在数据框中输入文件名和ID。
这是我的json示例
[
{'FileName': '32252652D.article.0018038745057751440210.tmp',
'_id': {'$oid': '5ced0669acd01707cbf2ew33'},
'section_details': [{'content': 'Efficient Algorithms for Non-convex Isotonic '
'Regression through Submodular Optimization ',
'heading': 'title'},
{'content': 'We consider the minimization of submodular '
'functions subject to ordering constraints. We show that '
'this potentially non-convex optimization problem can '
'be cast as a convex optimization problem on a space of '
'uni-dimensional measures',
'heading': 'abstract'},
{'content': '', 'heading': 'subject'},
{'content': ' Introduction to convex optimization'
'with mean ',
'heading': 'Content'}]},
{'FileName': '32252652D.article.0018038745057751440210.tmp',
'_id': {'$oid': '5ced0669acd01707cbf2ew11'},
'section_details': [{'content': 'Text-Adaptive Generative Adversarial Networks: '
'Manipulating Images with Natural Language ',
'heading': 'title'},
{'content': 'This paper addresses the problem of manipulating '
'images using natural language description. Our '
'task aims to semantically modify visual '
'attributes of an object in an image according '
'to the text describing the new visual',
'heading': 'abstract'},
{'content': '', 'heading': 'subject'},
{'content': ' Introduction to Text-Adaptive Generative Adversarial Networks',
'heading': 'Content'}]}
]
预期产量
答案 0 :(得分:0)
请告诉我您是否愿意将输出显示为:
>>> import pandas as pd
>>> import json
>>> j = [
... {'FileName': '32252652D.article.0018038745057751440210.tmp',
... '_id': {'$oid': '5ced0669acd01707cbf2ew33'},
... 'section_details': [{'content': 'Efficient Algorithms for Non-convex Isotonic '
... 'Regression through Submodular Optimization ',
... 'heading': 'title'},
... {'content': 'We consider the minimization of submodular '
... 'functions subject to ordering constraints. We show that '
... 'this potentially non-convex optimization problem can '
... 'be cast as a convex optimization problem on a space of '
... 'uni-dimensional measures',
... 'heading': 'abstract'},
... {'content': '', 'heading': 'subject'},
... {'content': ' Introduction to convex optimization'
... 'with mean ',
... 'heading': 'Content'}]},
... {'FileName': '32252652D.article.0018038745057751440210.tmp',
... '_id': {'$oid': '5ced0669acd01707cbf2ew11'},
... 'section_details': [{'content': 'Text-Adaptive Generative Adversarial Networks: '
... 'Manipulating Images with Natural Language ',
... 'heading': 'title'},
... {'content': 'This paper addresses the problem of manipulating '
... 'images using natural language description. Our '
... 'task aims to semantically modify visual '
... 'attributes of an object in an image according '
... 'to the text describing the new visual',
... 'heading': 'abstract'},
... {'content': '', 'heading': 'subject'},
... {'content': ' Introduction to Text-Adaptive Generative Adversarial Networks',
... 'heading': 'Content'}]}
... ]
>>> pd.DataFrame(j)
FileName _id section_details
0 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew33'} [{'content': 'Efficient Algorithms for Non-con...
1 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew11'} [{'content': 'Text-Adaptive Generative Adversa...
答案 1 :(得分:0)
json_normalize
方法可以传递一个元数据数组,以添加到每个记录中。
在这里,假设js包含原始json中的数据,则可以使用:
df = json_normalize(js, 'section_details',['FileName', '_id'])
您将获得:
FileName _id content heading
0 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew33'} Efficient Algorithms for Non-convex Isotonic R... title
1 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew33'} We consider the minimization of submodular fu... abstract
2 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew33'} subject
3 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew33'} Introduction to convex optimizationwith mean Content
4 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew11'} Text-Adaptive Generative Adversarial Networks:... title
5 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew11'} This paper addresses the problem of manipulati... abstract
6 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew11'} subject
7 32252652D.article.0018038745057751440210.tmp {'$oid': '5ced0669acd01707cbf2ew11'} Introduction to Text-Adaptive Generative Adve... Content
此后,您仍然必须修复_id
列并旋转数据框。最后,您可以结束:
# extract relevant infos
df = json_normalize(js, 'section_details',['FileName', '_id'])
# fix _id column
df['_id'] = df['_id'].apply(lambda x: x['$oid'])
# pivot to get back the expected columns
resul = df.groupby('FileName').apply(lambda x: x.pivot(
'_id', 'heading', 'content')).reset_index().rename_axis('', axis=1)
或者,您可以直接从原始json的每一行中手动构建数据框行 :
resul = pd.DataFrame([dict([('FileName',j['FileName']), ('_id', j['_id']['$oid'])]
+list({sd['heading']: sd['content'] for sd in j['section_details']
}.items())) for j in js]).reindex(columns=['FileName',
'_id', 'title', 'abstract', 'subject', 'Content']