我有一个结构如下的数据框:
Mumbai
其中“ SECOND”是一个列表,“ COUNT”是一个事件列表(例如,如第一行所示,我有2个“ A”事件,1个“ B”事件)
我想要做的是通过“ FIRST”聚合该数据帧分组,并连接“ SECOND”和“ COUNT”(在“ SECOND”相同的情况下,将“ COUNT”相加),得到如下结果:
user
与:
<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B']</td><td>['2','1']</tr>
<tr><td>2</td><td>['C','D']</td><td>['1','1']</tr>
<tr><td>1</td><td>['A','E']</td><td>['1','1']</tr>
<tr><td>2</td><td>['C','F']</td><td>['2','1']</tr>
</table>
我设法通过首先将“ SECOND”和“ COUNT”连接在一起进行分组,我该怎么做才能按“ SECOND”进行分组并在相等的地方将“ COUNT”值相加?
答案 0 :(得分:1)
让我们逐步解决问题,以便最清楚地了解正在发生的事情。 所有解释在下面的代码注释中给出。简而言之,列表字符串将转换为列表。列表合并成字典。分组时,使用专门编写的功能组合字典。结果放置在字段中,所有多余的部分都被切除。
import pandas as pd
s1 = """\
<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B']</td><td>['2','1']</tr>
<tr><td>2</td><td>['C','D']</td><td>['1','1']</tr>
<tr><td>1</td><td>['A','E']</td><td>['1','1']</tr>
<tr><td>2</td><td>['C','F']</td><td>['2','1']</tr>
</table>
"""
tables = pd.read_html(s1)
df = tables[0]
# now we have the Pandas Dataframe with columns: 'FIRST', 'SECOND', 'COUNT'
# 'FIRST' contains integers
# 'SECOND' contains strings, not lists of strings
# 'COUNT' contains strings, not lists of integers
import ast
# convert 'SECOND' from string to list of strings
df['SECOND_LIST'] = df['SECOND'].apply(lambda x: ast.literal_eval(x))
# convert 'COUNT' from string to list of integers
df['COUNT_LIST'] = df['COUNT'].apply(lambda x: list(map(int, ast.literal_eval(x))))
from typing import List, Dict, Set # import support for some type hints
def merge_dicts_sum(dict_list: List[Dict[str, int]]) -> Dict[str, int]:
"""
merge a list of dicts to one dict, summing values for the same keys
"""
keys: Set[str] = set() # set of all unique keys of all dicts in list of dicts
for dict_item in dict_list:
keys.update(dict_item.keys())
result_dict: Dict[str, int] = {} # we will collect sums of values here
for key in keys:
result_dict[key] = 0 #
for dict_item in dict_list:
if key in dict_item:
result_dict[key] += dict_item[key]
return dict(sorted(result_dict.items())) # sort result by key then return it
# create a new dataframe by grouping by field 'FIRST' and aggregating dicts from SECOND_COUNT_DICT to one list
df_gr = df.groupby('FIRST').agg(SECOND_COUNT_DICT_LIST=('SECOND_COUNT_DICT',list))
# megre dicts from 'SECOND_COUNT_DICT_LIST' to one dict using function 'merge_dicts_sum'
df_gr['MERGED_DICT'] = df_gr['SECOND_COUNT_DICT_LIST'].apply(merge_dicts_sum)
df_gr['SECOND'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.keys())) # place keys from 'MERGED_DICT' to field 'SECOND'
df_gr['COUNT'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.values())) # place values from 'MERGED_DICT' to field 'COUNT'
df_result = df_gr[['SECOND', 'COUNT']] # create a new dataframe by remaining only 'SECOND' and 'COUNT' fields
# store df_result as html
result_str = df_result.to_html(index=False)
print(f"{result_str}")