我在postgresql中有一个表-a_table
-其中1列-previous_names
-存储为json数组-CREATE a_table (..., previous name JSON [], ...)
。
我使用此代码段将表格上传到熊猫df:
DBNAME = "dname"
USER = "uame"
conn = psycopg2.connect("dbname={} user={}".format(DBNAME, USER))
cur = conn.cursor()
cur.execute("SET search_path TO schema_name")
conn.commit()
sql = "select * from a_table"
data = pd.read_sql_query(sql, conn)
当我下载csv并将其上传到pandas df时,所讨论的列包含jsons数组(长度可变):
所以某个记录将是:
[
{
"effective_from": "2006-08-02",
"ceased_on": "2006-08-16",
"name": "SUPERSTAY LIMITED"
}
]
另一个是:
[
{
"effective_from": "2006-09-19",
"ceased_on": "2012-01-31",
"name": "MCM SYSTEMS (PIB) LIMITED"
},
{
"ceased_on": "2006-09-19",
"effective_from": "2006-07-24",
"name": "MCM SYSTEMS (FDT) LIMITED"
}
]
此列中的key:value
对并不总是相同的-记录也可以是NaN
。
在Pandas中展平此列的最佳方法是什么?
我尝试过,但是没有用:
json_normalize(data=data[data.previous_company_names != None])
AttributeError: 'str' object has no attribute 'values'
理想情况下-我将能够使该列变平的df
之前
col_id | col_name | previous_names
-------+------------+-----------------
1 | 'Corp.' | [{"effective_from": "2006-08-02","ceased_on": "2006-08-16","name": "SUPERSTAY LIMITED"}]
2 | 'Company' | [{"effective_from": "2006-09-19","ceased_on": "2012-01-31","name": "MCM SYSTEMS (PIB) LIMITED"}, {"ceased_on": "2006-09-19","effective_from": "2006-07-24","name": "MCM SYSTEMS (FDT) LIMITED"}]
3 | 'Entr' | None
之后
col_1 | col_2 | effective_from | ceased_on | name
------+------------+------------------+--------------+------------------------------
1 | 'Corp.' | '2006-08-02' | '2006-08-16' | 'SUPERSTAY LIMITED'
2 | 'Company' | '2006-09-19' | '2006-09-19' | 'MCM SYSTEMS (PIB) LIMITED'
2 | 'Company' | '2006-07-24' | '2006-09-19' | 'MCM SYSTEMS (FDT) LIMITED'
3 | 'Entr' | None | None | None
也许这对于熊猫来说太复杂了,应该在postgresql中完成吗?
答案 0 :(得分:0)
如果您有一个DF,例如您所举的例子:
dd = [
{
'col_id': 0,
'col_name': 'Corp.',
'previous_names': [
{
"effective_from": "2006-08-02",
"ceased_on": "2006-08-16",
"name": "SUPERSTAY LIMITED"
}
]
},
{
'col_id': 1,
'col_name': 'Company',
'previous_names': [
{
"effective_from": "2006-09-19",
"ceased_on": "2012-01-31",
"name": "MCM SYSTEMS (PIB) LIMITED"
},
{
"ceased_on": "2006-09-19",
"effective_from": "2006-07-24",
"name": "MCM SYSTEMS (FDT) LIMITED"
}
]
},
{
'col_id': 2,
'col_name': 'Entr',
'previous_names': None
}
]
ddf = pd.DataFrame(dd)
您可以使用iterrows()
:
col_name, col_id, effective_from, ceased_on, name = [], [], [], [], []
for i in ddf.iterrows():
if i[1].previous_names:
for x in i[1].previous_names:
col_id.append(i[1]['col_id'])
col_name.append(i[1]['col_name'])
effective_from.append(x['effective_from'])
ceased_on.append(x['ceased_on'])
name.append(x['name'])
else:
col_id.append(i[1]['col_id'])
col_name.append(i[1]['col_name'])
effective_from.append(np.nan)
ceased_on.append(np.nan)
name.append(np.nan)
pd.DataFrame({'col_id': col_id, 'col_name': col_name, 'effective_from': effective_from, 'ceased_on': ceased_on, 'name': name })
并获得您想要的东西:
col_id col_name effective_from ceased_on name
0 0 Corp. 2006-08-02 2006-08-16 SUPERSTAY LIMITED
1 1 Company 2006-09-19 2012-01-31 MCM SYSTEMS (PIB) LIMITED
2 1 Company 2006-07-24 2006-09-19 MCM SYSTEMS (FDT) LIMITED
3 2 Entr NaN NaN NaN