我正在使用python3和pandas 0.25版。我在postgresql表中有一个JSON数据类型。我正在使用pandas.io.sql从表中获取数据。
import pandas.io.sql as psql
df = psql.read_sql(sql,con,params=params)
因此,我如上所述从数据库调用中获取数据帧。
当我(使用IDE)检查df的输出时,看到带有以下内容的数据框:
我想汇总数据;为了简单起见,仅选择3列。我需要按col1_data分组。我想要如下:
基本上,它聚合在多个列上。但是主要问题是合并json列。哪个聚合函数可以在这里帮助我?
基于先前的帮助,要使用lambda合并json列,我尝试按照以下步骤进行操作。但是,它不起作用。我尝试首先使用json列,其他的可能是简单的总和。
df = df.groupby(['col1_data']).apply(lambda row: [{**x} for x in row['col2_data']])
我遇到了错误:
'list' object is not a mapping
有人可以在这里帮助我吗?谢谢。
更新:
以下代码可用于创建示例数据框:
import collections
import datetime
import pandas as pd
import numpy as np
data = {
'col1_data': ['A1', 'A1'],
'col2_data': [[{"scenario": 1, "scenario_name": "Test", "value": "100"}], [{"scenario": 1, "scenario_name": "Test1", "value": "10"}, {"scenario": 2, "scenario_name": "Test2", "value": "500"}]]
}
df = pd.DataFrame(data)
with pd.option_context('display.max_colwidth', 1000): # more options can be specified also
print(df)
所以我需要对col1_data进行分组,并且col2_data应该如上所述合并为json。
更新2:
该解决方案适用于上述数据集。 但是,当我在col1_data中有2个唯一值时,它将不起作用。
data = {
'col1_data': ['A1', 'A1', 'A2', 'A2'],
'col2_data': [[{"scenario": 1, "scenario_name": "Test", "value": "100"}], [{"scenario": 1, "scenario_name": "Test1", "value": "10"}, {"scenario": 2, "scenario_name": "Test2", "value": "500"}],[{"scenario": 1, "scenario_name": "Test", "value": "10"}], [{"scenario": 1, "scenario_name": "Test1", "value": "110"}, {"scenario": 2, "scenario_name": "Test2", "value": "1500"}]]
}
df = pd.DataFrame(data)
DF的输出:
col1_data \
0 A1
1 A1
2 A2
3 A2
col2_data
0 [{'scenario': 1, 'scenario_name': 'Test', 'value': '100'}]
1 [{'scenario': 1, 'scenario_name': 'Test1', 'value': '10'}, {'scenario': 2, 'scenario_name': 'Test2', 'value': '500'}]
2 [{'scenario': 1, 'scenario_name': 'Test', 'value': '10'}]
3 [{'scenario': 1, 'scenario_name': 'Test1', 'value': '110'}, {'scenario': 2, 'scenario_name': 'Test2', 'value': '1500'}]
现在,当我运行相同的功能时,出现以下错误:
df = (df
.groupby('col1_data')['col2_data']
.apply(lambda x: np.concatenate(x).tolist())
.reset_index())
错误:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
724 try:
--> 725 result = self._python_apply_general(f)
726 except Exception:
/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
741 def _python_apply_general(self, f):
--> 742 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
743
/usr/local/lib64/python3.6/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
236 group_axes = _get_axes(group)
--> 237 res = f(group)
238 if not _is_indexed_like(res, group_axes):
<ipython-input-109-61a2e6a29020> in <lambda>(x)
6 .groupby('col1_data')['col2_data']
----> 7 .apply(lambda x: np.concatenate(x).tolist())
8 .reset_index())
<__array_function__ internals> in concatenate(*args, **kwargs)
/usr/local/lib64/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
1067 try:
-> 1068 result = self.index.get_value(self, key)
1069
/usr/local/lib64/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
4729 try:
-> 4730 return self._engine.get_value(s, k, tz=getattr(series.dtype, "tz", None))
4731 except KeyError as e1:
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_item()
KeyError: 0
有什么想法吗?
答案 0 :(得分:1)
这是您可以尝试的一种方式:
import numpy as np
f = (df
.groupby('col1_data')['col2_data']
.apply(lambda x: np.concatenate(x).tolist())
.reset_index())
col1_data col2_data
0 A1 [{'scenario': 1, 'scenario_name': 'Test', 'val...
解决方案二:
f = (df
.groupby('col1_data')['col2_data']
.apply(lambda x: np.concatenate(x.values))
.reset_index())
答案 1 :(得分:0)
您可以尝试以下方法:
new_dat = {col:[] for col in df.columns}
for key,val in df.groupby('col1_data'):
new_dat['col1_data'] += [key]
new_dat['col2_data'] += [[dic for lst in val['col2_data'] for dic in lst]]
new_df_1 = pd.DataFrame(new_dat)
col1_data col2_data
0 A1 [{'scenario': 1, 'scenario_name': 'Test', 'val...
1 A2 [{'scenario': 1, 'scenario_name': 'Test', 'val...
或与@YOLO的答案相同:
new_df_2 = (df
.groupby('col1_data')['col2_data']
.apply(lambda x: [dic for lst in x for dic in lst])
.reset_index())
col1_data col2_data
0 A1 [{'scenario': 1, 'scenario_name': 'Test', 'val...
1 A2 [{'scenario': 1, 'scenario_name': 'Test', 'val...