遍历包含嵌套json dict和数组的Pandas数据框

时间:2018-12-19 13:29:12

标签: python pandas performance dataframe

我有一个或多或少的同构json字典列表,我将它们加载到Pandas数据框中。任何给定的格可以包含仅由其他格或数组组成的级别的任意数量,例如:

[ {"id": [0], "options": [{"name": "dhl", "price": 10}]}, {"id": [0, 1], "options": [{"name": "dhl", "price": 50}, {"name": "fedex", "price": "100"}]}, ]

现在,我希望能够有效地检查特定字段-针对正则表达式,比较两个数据框之间的整个列,等等。id, options.name, options.price是本示例中的字段

我发现一种方法是将数据帧展平一次,这使我们可以使用向量化操作,例如str.contains

这是我的递归解决方案。

def flatten_df(df, i=0, columns_map=None):
    if not columns_map:
        columns_map = {}

    for c in df.columns[i:]:
        flattened_columns = expand_column(df, c)
        if flattened_columns.empty:
            i += 1
            continue

        def name_column(x):
            new_name = f"{c}_{x}"
            if new_name in df.columns:
                new_name = f"{c}_{uuid.uuid1().hex[:5]}"

            if c in columns_map:
                columns_map[new_name] = columns_map[c]
            else:
                columns_map[new_name] = c
            return new_name

        flattened_columns = flattened_columns.rename(columns=name_column)
        df = pd.concat([df[:], flattened_columns[:]], axis=1).drop(c, axis=1)
        columns_map.pop(c, None)
        return flatten_df(df, i, columns_map)
    return df, columns_map

def expand_column(df, column):
    mask = df[column].map(lambda x: (isinstance(x, list) or isinstance(x, dict)))
    collection_column = df[mask][column]
    return collection_column.apply(pd.Series)

这是输出:

id_0 id_1 options_0_name options_0_price options_1_name options_1_price 0 0.0 NaN dhl 10 NaN NaN 1 0.0 1.0 dhl 50 fedex 100

现在,我可以执行矢量化方法,并在需要时将扩展的列映射到原始列。

但是,由于列表的大小可能很大-多达数百万个字典,因此此解决方案的性能随嵌套字段数量的增加(即递归数量的增加)而显着降低。

我以前使用过pandas.io.json.json_normalize,但它仅扩展了字典。

还有其他有效的方法吗?数据可以变化,但是对它的操作次数是有限的。

更新性能统计信息:

这些是%prun的数字,用于表示200k具有嵌套字段数量相对较少的项的数组:

         101001482 function calls (100789761 primitive calls) in 79.717 seconds

   Ordered by: internal time
   List reduced from 478 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
 22800000   10.062    0.000   16.327    0.000 <ipython-input-8-786bcc78e0b9>:56(<lambda>)
 53689789    9.168    0.000   10.769    0.000 {built-in method builtins.isinstance}
      139    6.827    0.049   44.534    0.320 {pandas._libs.lib.map_infer}
       25    4.134    0.165    6.469    0.259 internals.py:5074(_merge_blocks)
     26/1    3.525    0.136   79.574   79.574 <ipython-input-8-786bcc78e0b9>:1(flatten_df)
       28    2.958    0.106    2.958    0.106 {pandas._libs.algos.take_2d_axis0_object_object}
      217    2.416    0.011    2.416    0.011 {method 'copy' of 'numpy.ndarray' objects}
      100    2.355    0.024    2.355    0.024 {built-in method numpy.core.multiarray.concatenate}
   102236    2.223    0.000    2.784    0.000 generic.py:4378(__setattr__)
    66259    2.022    0.000    2.022    0.000 {pandas._libs.lib.maybe_convert_objects}
    66261    1.606    0.000    2.670    0.000 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
    66510    1.413    0.000    3.235    0.000 cast.py:971(maybe_cast_to_datetime)
   133454    1.257    0.000    1.257    0.000 {built-in method numpy.core.multiarray.empty}
69329/34771    1.232    0.000    5.796    0.000 base.py:255(__new__)
101377/66756    1.178    0.000   21.435    0.000 series.py:166(__init__)
   468050    1.102    0.000    4.105    0.000 common.py:1688(is_extension_array_dtype)
  1850890    1.089    0.000    1.089    0.000 {built-in method builtins.hasattr}
   872564    1.044    0.000    2.070    0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
    66400    1.005    0.000    8.859    0.000 algorithms.py:1548(take_nd)
464282/464168    0.940    0.000    0.942    0.000 {built-in method numpy.core.multiarray.array}

我看到检查数据类型花费了大量时间。

1 个答案:

答案 0 :(得分:0)

我使用apply(pd.Series) 完成了矢量化解决方案,尽管我不得不编写一些其他代码来使其按预期工作。

flatten_list_cols-用于其中包含原始元素列表的列
flatten_list_of_dict_cols-用于其中包含 dictionary 元素列表

的列

这是解决方案

import pandas as pd

df = pd.DataFrame([
    {"id": [0], "options": [{"name": "dhl", "price": 10}]},
    {"id": [0, 1], "options": [{"name": "dhl", "price": 50}, {"name": "fedex", "price": 100}]},
])

def flatten_list_cols(df, columns):
    for col in columns:
        # Flatten list of elements into individual columns (e.g. id: [0, 1] to columns id_0 and id_1)
        df = pd.concat([df, df[col].apply(pd.Series).add_prefix(f'{col}_')], axis=1)
        df = df.drop(col, axis=1)

    return df

def flatten_list_of_dict_cols(df, columns):
    for col in columns:
        # Flatten list of elements into individual columns (e.g. id: [0, 1] to columns id_0 and id_1)
        df = pd.concat([df, df[col].apply(pd.Series).add_prefix(f'{col}_')], axis=1)
        # Drop initial columns
        df = df.drop(col, axis=1)

        # Flatten all resulted "dict" columns
        cols_to_flatten = df.filter(regex=f'{col}').columns
        for i in cols_to_flatten:
            df = pd.concat([df, df[i].apply(pd.Series).add_prefix(f'{i}_')], axis=1)

            # Drop redundant columns
            if f'{i}_0' in df.columns:
                df = df.drop(f'{i}_0', axis=1)

        # Drop already flattened columns (with individual dicts, e.g. "options_0", "options_1" e.t.c
        for i in range(0, len(cols_to_flatten)):
            df = df.drop(f'{col}_{i}', axis=1)

    return df

df = flatten_list_cols(df, ['id'])
df = flatten_list_of_dict_cols(df, ['options'])

结果:

df
    id_0    id_1    options_0_name  options_0_price options_1_name  options_1_price
0   0.0     NaN     dhl             10              NaN             NaN
1   0.0     1.0     dhl             50              fedex           100.0