我有一个或多或少的同构json字典列表,我将它们加载到Pandas数据框中。任何给定的格可以包含仅由其他格或数组组成的级别的任意数量,例如:
[
{"id": [0], "options": [{"name": "dhl", "price": 10}]},
{"id": [0, 1], "options": [{"name": "dhl", "price": 50}, {"name": "fedex", "price": "100"}]},
]
现在,我希望能够有效地检查特定字段-针对正则表达式,比较两个数据框之间的整个列,等等。id, options.name, options.price
是本示例中的字段
我发现一种方法是将数据帧展平一次,这使我们可以使用向量化操作,例如str.contains
。
这是我的递归解决方案。
def flatten_df(df, i=0, columns_map=None):
if not columns_map:
columns_map = {}
for c in df.columns[i:]:
flattened_columns = expand_column(df, c)
if flattened_columns.empty:
i += 1
continue
def name_column(x):
new_name = f"{c}_{x}"
if new_name in df.columns:
new_name = f"{c}_{uuid.uuid1().hex[:5]}"
if c in columns_map:
columns_map[new_name] = columns_map[c]
else:
columns_map[new_name] = c
return new_name
flattened_columns = flattened_columns.rename(columns=name_column)
df = pd.concat([df[:], flattened_columns[:]], axis=1).drop(c, axis=1)
columns_map.pop(c, None)
return flatten_df(df, i, columns_map)
return df, columns_map
def expand_column(df, column):
mask = df[column].map(lambda x: (isinstance(x, list) or isinstance(x, dict)))
collection_column = df[mask][column]
return collection_column.apply(pd.Series)
这是输出:
id_0 id_1 options_0_name options_0_price options_1_name options_1_price
0 0.0 NaN dhl 10 NaN NaN
1 0.0 1.0 dhl 50 fedex 100
现在,我可以执行矢量化方法,并在需要时将扩展的列映射到原始列。
但是,由于列表的大小可能很大-多达数百万个字典,因此此解决方案的性能随嵌套字段数量的增加(即递归数量的增加)而显着降低。
我以前使用过pandas.io.json.json_normalize
,但它仅扩展了字典。
还有其他有效的方法吗?数据可以变化,但是对它的操作次数是有限的。
更新性能统计信息:
这些是%prun
的数字,用于表示200k具有嵌套字段数量相对较少的项的数组:
101001482 function calls (100789761 primitive calls) in 79.717 seconds
Ordered by: internal time
List reduced from 478 to 20 due to restriction <20>
ncalls tottime percall cumtime percall filename:lineno(function)
22800000 10.062 0.000 16.327 0.000 <ipython-input-8-786bcc78e0b9>:56(<lambda>)
53689789 9.168 0.000 10.769 0.000 {built-in method builtins.isinstance}
139 6.827 0.049 44.534 0.320 {pandas._libs.lib.map_infer}
25 4.134 0.165 6.469 0.259 internals.py:5074(_merge_blocks)
26/1 3.525 0.136 79.574 79.574 <ipython-input-8-786bcc78e0b9>:1(flatten_df)
28 2.958 0.106 2.958 0.106 {pandas._libs.algos.take_2d_axis0_object_object}
217 2.416 0.011 2.416 0.011 {method 'copy' of 'numpy.ndarray' objects}
100 2.355 0.024 2.355 0.024 {built-in method numpy.core.multiarray.concatenate}
102236 2.223 0.000 2.784 0.000 generic.py:4378(__setattr__)
66259 2.022 0.000 2.022 0.000 {pandas._libs.lib.maybe_convert_objects}
66261 1.606 0.000 2.670 0.000 {method 'get_indexer' of 'pandas._libs.index.IndexEngine' objects}
66510 1.413 0.000 3.235 0.000 cast.py:971(maybe_cast_to_datetime)
133454 1.257 0.000 1.257 0.000 {built-in method numpy.core.multiarray.empty}
69329/34771 1.232 0.000 5.796 0.000 base.py:255(__new__)
101377/66756 1.178 0.000 21.435 0.000 series.py:166(__init__)
468050 1.102 0.000 4.105 0.000 common.py:1688(is_extension_array_dtype)
1850890 1.089 0.000 1.089 0.000 {built-in method builtins.hasattr}
872564 1.044 0.000 2.070 0.000 <frozen importlib._bootstrap>:1009(_handle_fromlist)
66400 1.005 0.000 8.859 0.000 algorithms.py:1548(take_nd)
464282/464168 0.940 0.000 0.942 0.000 {built-in method numpy.core.multiarray.array}
我看到检查数据类型花费了大量时间。
答案 0 :(得分:0)
我使用apply(pd.Series)
完成了矢量化解决方案,尽管我不得不编写一些其他代码来使其按预期工作。
flatten_list_cols
-用于其中包含原始元素列表的列
flatten_list_of_dict_cols
-用于其中包含 dictionary 元素列表
这是解决方案:
import pandas as pd
df = pd.DataFrame([
{"id": [0], "options": [{"name": "dhl", "price": 10}]},
{"id": [0, 1], "options": [{"name": "dhl", "price": 50}, {"name": "fedex", "price": 100}]},
])
def flatten_list_cols(df, columns):
for col in columns:
# Flatten list of elements into individual columns (e.g. id: [0, 1] to columns id_0 and id_1)
df = pd.concat([df, df[col].apply(pd.Series).add_prefix(f'{col}_')], axis=1)
df = df.drop(col, axis=1)
return df
def flatten_list_of_dict_cols(df, columns):
for col in columns:
# Flatten list of elements into individual columns (e.g. id: [0, 1] to columns id_0 and id_1)
df = pd.concat([df, df[col].apply(pd.Series).add_prefix(f'{col}_')], axis=1)
# Drop initial columns
df = df.drop(col, axis=1)
# Flatten all resulted "dict" columns
cols_to_flatten = df.filter(regex=f'{col}').columns
for i in cols_to_flatten:
df = pd.concat([df, df[i].apply(pd.Series).add_prefix(f'{i}_')], axis=1)
# Drop redundant columns
if f'{i}_0' in df.columns:
df = df.drop(f'{i}_0', axis=1)
# Drop already flattened columns (with individual dicts, e.g. "options_0", "options_1" e.t.c
for i in range(0, len(cols_to_flatten)):
df = df.drop(f'{col}_{i}', axis=1)
return df
df = flatten_list_cols(df, ['id'])
df = flatten_list_of_dict_cols(df, ['options'])
结果:
df
id_0 id_1 options_0_name options_0_price options_1_name options_1_price
0 0.0 NaN dhl 10 NaN NaN
1 0.0 1.0 dhl 50 fedex 100.0