Question

我有一个.csv文件，其中包含列混合，其中一些包含JSON语法的条目（嵌套）。我想从这些列中提取相关数据，以获得更加丰富的数据框架，以进行进一步的分析。我已经检查了此tutorial on Kaggle，但未能获得所需的结果。

为了更好地解释我的问题，我在下面准备了一个虚拟版本的数据库。

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}

df = pd.DataFrame(raw)

我想生成以下列（或等效列）：

team      name_1   name_2   age_1    age_2
Team_1    Andy     Rick     22       30
Team_2    Oli      Joe      19       21

我尝试了以下方法。

代码1：

test_norm = json_normalize(data=df)
AttributeError: 'str' object has no attribute 'values'

代码2：

test_norm = json_normalize(data=df, record_path='who')
TypeError: string indices must be integers

代码3：

test_norm = json_normalize(data=df, record_path='who', meta=[team])
TypeError: string indices must be integers

有什么办法可以有效地做到这一点？我已经在其他stackoverflow主题中寻找解决方案，但是找不到json_normalize的有效解决方案。

Answer 1

在who列中包含的字典列表上使用json_normalize时，我也遇到麻烦。我的解决方法是使用每位团队成员的姓名/年龄的唯一键（name_1，age_1，name_2等）将每一行重新格式化为Dict。在此之后，创建具有所需结构的数据框很简单。

这是我的脚步。从您的示例开始：

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}

df = pd.DataFrame(raw)
df

    team    who
0   Team_1  [{'name': 'Andy', 'age': 22}, {'name': 'Rick',...
1   Team_2  [{'name': 'Oli', 'age': 19}, {'name': 'Joe', '...

编写一种将列表重新格式化为Dict并将其应用于who列中每一行的方法：

def reformat(x):
    res = {}
    for i, item in enumerate(x):
        res['name_' + str(i+1)] = item['name']
        res['age_' + str(i+1)] = item['age']
    return res

df['who'] = df['who'].apply(lambda x: reformat(x))
df

    team    who
0   Team_1  {'name_1': 'Andy', 'age_1': 22, 'name_2': 'Ric...
1   Team_2  {'name_1': 'Oli', 'age_1': 19, 'name_2': 'Joe'...

在who列上使用json_normalize。然后确保规范化数据框的列以所需顺序显示：

import pandas as pd 
from pandas.io.json import json_normalize

n = json_normalize(data = df['who'], meta=['team'])
n = n.reindex(sorted(n.columns, reverse=True, key=len), axis=1)
n

    name_1  name_2  age_1   age_2
0   Andy    Rick       22      30
1   Oli     Joe        19      21

将json_normalize创建的数据框加入到原始df中，并删除who列：

df = df.join(n).drop('who', axis=1)
df

    team    name_1  name_2  age_1   age_2
0   Team_1  Andy    Rick       22      30
1   Team_2  Oli     Joe        19      21

如果您的真实.csv文件有太多行，那么我的解决方案可能会有点太昂贵（请参阅如何在每一行上进行迭代，然后在每一行所包含的列表内的每个条目上进行迭代）。如果不是（希望如此），也许我的方法就足够了。

Answer 2

一种选择是自己解压缩字典。像这样：

from pandas.io.json import json_normalize 

raw = {"team":["Team_1","Team_2"],
       "who":[[{"name":"Andy", "age":22},{"name":"Rick", "age":30}],[{"name":"Oli", "age":19},{"name":"Joe", "age":21}]]}


# add the corresponding team to the dictionary containing the person information
for idx, list_of_people in enumerate(raw['who']):
    for person in list_of_people:
        person['team'] = raw['team'][idx]

# flatten the dictionary
list_of_dicts = [dct for list_of_people in raw['who'] for dct in list_of_people]

# normalize to dataframe
json_normalize(list_of_dicts)

# due to unpacking of dict, this results in the same as doing
pd.DataFrame(list_of_dicts)

这输出有些不同。我的输出通常更方便进一步分析。

输出：

age name    team
22  Andy    Team_1
30  Rick    Team_1
19  Oli     Team_2
21  Joe     Team_2

Answer 3

您可以分别遍历raw['who']中的每个元素，但是当您这样做时，结果数据帧会将两个对手放在单独的行中。

示例：

json_normalize(raw['who'][0])

Output:

age     name
22      Andy
30      Rick

您可以将它们展平为一行，然后将所有行连接起来以获得最终输出。

def flatten(df_temp):
    df_temp.index = df_temp.index.astype(str)
    flattened_df = df_temp.unstack().to_frame().sort_index(level=1).T
    flattened_df.columns = flattened_df.columns.map('_'.join)
    return flattened_df

df = pd.concat([flatten(pd.DataFrame(json_normalize(x))) for x in raw['who']])
df['team'] = raw['team']

输出：

age_0   name_0  age_1   name_1  team
22      Andy    30      Rick    Team_1
19      Oli     21      Joe     Team_2

如何将熊猫JSON列转换为数据框？

3 个答案: