我对JSON
和Python
来说还比较陌生,并且自最近两天以来,我一直在努力简化JSON。
我在http://pandas.pydata.org/pandas-docs/version/0.19/generated/pandas.io.json.json_normalize.html阅读了该示例,但是我不明白如何取消列出一些嵌套元素。我还阅读了几个线程Flatten JSON based on an attribute - python How to normalize complex nested json in python?和https://towardsdatascience.com/flattening-json-objects-in-python-f5343c794b10。我尝试了所有没有运气。
这是我的JSON文件的第一条记录:
d =
{'city': {'url': 'link',
'name': ['San Francisco']},
'rank': 1,
'resident': [
{'link': ['bit.ly/0842/'], 'name': ['John A']},
{'link': ['bit.ly/5835/'], 'name': ['Tedd B']},
{'link': ['bit.ly/2011/'], 'name': ['Cobb C']},
{'link': ['bit.ly/0855/'], 'name': ['Jack N']},
{'link': ['bit.ly/1430/'], 'name': ['Jack K']},
{'link': ['bit.ly/3081/'], 'name': ['Edward']},
{'link': ['bit.ly/2001/'], 'name': ['Jack W']},
{'link': ['bit.ly/0020/'], 'name': ['Henry F']},
{'link': ['bit.ly/2137/'], 'name': ['Joseph S']},
{'link': ['bit.ly/3225/'], 'name': ['Ed B']},
{'link': ['bit.ly/3667/'], 'name': ['George Vvec']},
{'link': ['bit.ly/6434/'], 'name': ['Robert W']},
{'link': ['bit.ly/4036/'], 'name': ['Rudy B']},
{'link': ['bit.ly/6450/'], 'name': ['James K']},
{'link': ['bit.ly/5180/'], 'name': ['Billy N']},
{'link': ['bit.ly/7847/'], 'name': ['John S']}]
}
这是预期的输出:
city_url city_name rank resident_link resident_name
link San Francisco 1 'bit.ly/0842/' 'John A'
link San Francisco 1 'bit.ly/5835/' 'Tedd B'
link San Francisco 1 'bit.ly/2011/' 'Cobb C'
link San Francisco 1 'bit.ly/0855/' 'Jack N'
link San Francisco 1 'bit.ly/1430/' 'Jack K'
link San Francisco 1 'bit.ly/3081/' 'Edward'
link San Francisco 1 'bit.ly/2001/' 'Jack W'
link San Francisco 1 'bit.ly/0020/' 'Henry F'
link San Francisco 1 'bit.ly/2137/' 'Joseph S'
link San Francisco 1 'bit.ly/3225/' 'Ed B'
link San Francisco 1 'bit.ly/3667/' 'George Vvec'
link San Francisco 1 'bit.ly/6434/' 'Robert W'
link San Francisco 1 'bit.ly/4036/' 'Rudy B'
link San Francisco 1 'bit.ly/6450/' 'James K'
link San Francisco 1 'bit.ly/5180/' 'Billy N'
link San Francisco 1 'bit.ly/7847/' 'John S'
flatten_json()
函数(来自上面的Medium.com)破坏了层次结构。这是前几行:
{'city_url': 'link',
'city_name_0': 'San Francisco',
'rank': 1,
'resident_0_link_0': 'bit.ly/0842/',
'resident_0_name_0': 'John A', ...
有人可以帮助我如何考虑转换这些数据集吗?不幸的是,pandas
文档没有为初学者提供指导。这就是我在玩的东西。什么都没用。
from pandas.io.json import json_normalize
json_normalize(d,['city',['name','rank']])
json_normalize(d,['city','name','rank'])
json_normalize(d,['city','name'])
如果有人指导如何进行这种类型的转换和思考过程,我将不胜感激。
此外,由于原始数据集中的数据量大,我正在寻找矢量化操作或O(N)
操作而不是O(N2)
操作。因此,任何比O(N)
慢的速度都行不通。
答案 0 :(得分:1)
如果您知道json blob的结构,那就可以了
class Sale < ApplicationRecord
accepts_nested_attributes_for :sale_selections, allow_destroy: true
has_many :drinks, through: : sale_selections
has_many :foods, through: : sale_selections
end
哪个生产
resident_link = [k['link'][0] for k in d['resident']]
resident_name = [k['name'][0] for k in d['resident']]
n = len(d['resident'])
city_url = n * [d['city']['url']]
city_name = n * [d['city']['name'][0]]
rank = n * [d['rank']]
df = pandas.DataFrame({
'resident_name' : resident_name,
'resident_link' : resident_link,
'city_url' : city_url,
'city_name' : city_name,
'rank' : rank
})
编辑
正如OP在评论中所说,想象有很多这样的记录,每个记录都具有相同的结构
city_name city_url rank resident_link resident_name
0 San Francisco link 1 bit.ly/0842/ John A
1 San Francisco link 1 bit.ly/5835/ Tedd B
2 San Francisco link 1 bit.ly/2011/ Cobb C
3 San Francisco link 1 bit.ly/0855/ Jack N
4 San Francisco link 1 bit.ly/1430/ Jack K
5 San Francisco link 1 bit.ly/3081/ Edward
6 San Francisco link 1 bit.ly/2001/ Jack W
7 San Francisco link 1 bit.ly/0020/ Henry F
8 San Francisco link 1 bit.ly/2137/ Joseph S
9 San Francisco link 1 bit.ly/3225/ Ed B
10 San Francisco link 1 bit.ly/3667/ George Vvec
11 San Francisco link 1 bit.ly/6434/ Robert W
12 San Francisco link 1 bit.ly/4036/ Rudy B
13 San Francisco link 1 bit.ly/6450/ James K
14 San Francisco link 1 bit.ly/5180/ Billy N
15 San Francisco link 1 bit.ly/7847/ John S
nrecords = 10
dd = {k : d for k in range(nrecords)}
现在具有10个原始json blob副本。这就是应该如何更新代码
dd
下面是根据记录数量估算运行时间的信息。基于此,大约需要1小时才能完成150万条记录