扁平化包含Pandas中JSON数组的列

时间:2019-06-10 12:09:30

标签: python arrays json pandas postgresql

我在postgresql中有一个表-a_table-其中1列-previous_names-存储为json数组-CREATE a_table (..., previous name JSON [], ...)

我使用此代码段将表格上传到熊猫df:

DBNAME = "dname"
USER = "uame"
conn = psycopg2.connect("dbname={} user={}".format(DBNAME, USER))
cur = conn.cursor()

cur.execute("SET search_path TO schema_name")
conn.commit()

sql = "select * from a_table"
data = pd.read_sql_query(sql, conn)

当我下载csv并将其上传到pandas df时,所讨论的列包含jsons数组(长度可变):

所以某个记录将是:

[
 {
  "effective_from": "2006-08-02",
  "ceased_on": "2006-08-16",
  "name": "SUPERSTAY LIMITED"
 }
]

另一个是:

[
  {
    "effective_from": "2006-09-19",
    "ceased_on": "2012-01-31",
    "name": "MCM SYSTEMS (PIB) LIMITED"
  },
  {
    "ceased_on": "2006-09-19",
    "effective_from": "2006-07-24",
    "name": "MCM SYSTEMS (FDT) LIMITED"
  }
]

此列中的key:value对并不总是相同的-记录也可以是NaN

在Pandas中展平此列的最佳方法是什么?

我尝试过,但是没有用:

json_normalize(data=data[data.previous_company_names != None])

AttributeError: 'str' object has no attribute 'values'

理想情况下-我将能够使该列变平的df

之前

col_id | col_name   | previous_names
-------+------------+-----------------
1      | 'Corp.'    | [{"effective_from": "2006-08-02","ceased_on": "2006-08-16","name": "SUPERSTAY LIMITED"}]
2      | 'Company'  | [{"effective_from": "2006-09-19","ceased_on": "2012-01-31","name": "MCM SYSTEMS (PIB) LIMITED"}, {"ceased_on": "2006-09-19","effective_from": "2006-07-24","name": "MCM SYSTEMS (FDT) LIMITED"}]
3      | 'Entr'     | None

之后

col_1 | col_2      | effective_from   |  ceased_on   | name 
------+------------+------------------+--------------+------------------------------
1     | 'Corp.'    | '2006-08-02'     | '2006-08-16' | 'SUPERSTAY LIMITED'
2     | 'Company'  | '2006-09-19'     | '2006-09-19' | 'MCM SYSTEMS (PIB) LIMITED'
2     | 'Company'  | '2006-07-24'     | '2006-09-19' | 'MCM SYSTEMS (FDT) LIMITED'
3     | 'Entr'     | None             | None         | None

也许这对于熊猫来说太复杂了,应该在postgresql中完成吗?

1 个答案:

答案 0 :(得分:0)

如果您有一个DF,例如您所举的例子:

dd = [
    {
        'col_id': 0,
        'col_name': 'Corp.',
        'previous_names': [
            {
                "effective_from": "2006-08-02",
                "ceased_on": "2006-08-16",
                "name": "SUPERSTAY LIMITED"
            }
        ]
    },
    {
        'col_id': 1,
        'col_name': 'Company',
        'previous_names': [
            {
                "effective_from": "2006-09-19",
                "ceased_on": "2012-01-31",
                "name": "MCM SYSTEMS (PIB) LIMITED"
            },
            {
                "ceased_on": "2006-09-19",
                "effective_from": "2006-07-24",
                "name": "MCM SYSTEMS (FDT) LIMITED"
            }
        ]
    },
    {
        'col_id': 2,
        'col_name': 'Entr',
        'previous_names': None
    }
]
ddf = pd.DataFrame(dd)

您可以使用iterrows()

col_name, col_id, effective_from, ceased_on, name = [], [], [], [], []
for i in ddf.iterrows():
    if i[1].previous_names:
        for x in i[1].previous_names:
            col_id.append(i[1]['col_id'])
            col_name.append(i[1]['col_name'])
            effective_from.append(x['effective_from'])
            ceased_on.append(x['ceased_on'])
            name.append(x['name'])
    else:
        col_id.append(i[1]['col_id'])
        col_name.append(i[1]['col_name'])
        effective_from.append(np.nan)
        ceased_on.append(np.nan)
        name.append(np.nan)

pd.DataFrame({'col_id': col_id, 'col_name': col_name, 'effective_from': effective_from, 'ceased_on': ceased_on, 'name': name })

并获得您想要的东西:

    col_id  col_name    effective_from  ceased_on   name
0   0          Corp.    2006-08-02     2006-08-16   SUPERSTAY LIMITED
1   1          Company  2006-09-19     2012-01-31   MCM SYSTEMS (PIB) LIMITED
2   1          Company  2006-07-24     2006-09-19   MCM SYSTEMS (FDT) LIMITED
3   2          Entr     NaN            NaN          NaN