熊猫上的数据框聚合问题

时间:2020-05-15 15:54:06

标签: python pandas

我有一个结构如下的数据框:

Mumbai

其中“ SECOND”是一个列表,“ COUNT”是一个事件列表(例如,如第一行所示,我有2个“ A”事件,1个“ B”事件)

我想要做的是通过“ FIRST”聚合该数据帧分组,并连接“ SECOND”和“ COUNT”(在“ SECOND”相同的情况下,将“ COUNT”相加),得到如下结果:

user

与: <table border=1> <tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr> <tr><td>1</td><td>['A','B']</td><td>['2','1']</tr> <tr><td>2</td><td>['C','D']</td><td>['1','1']</tr> <tr><td>1</td><td>['A','E']</td><td>['1','1']</tr> <tr><td>2</td><td>['C','F']</td><td>['2','1']</tr> </table> 我设法通过首先将“ SECOND”和“ COUNT”连接在一起进行分组,我该怎么做才能按“ SECOND”进行分组并在相等的地方将“ COUNT”值相加?

1 个答案:

答案 0 :(得分:1)

让我们逐步解决问题,以便最清楚地了解正在发生的事情。 所有解释在下面的代码注释中给出。简而言之,列表字符串将转换为列表。列表合并成字典。分组时,使用专门编写的功能组合字典。结果放置在字段中,所有多余的部分都被切除。

import pandas as pd

s1 = """\
<table border=1>
<tr><th>FIRST</th><th>SECOND</th><th>COUNT</th></tr>
<tr><td>1</td><td>['A','B']</td><td>['2','1']</tr>
<tr><td>2</td><td>['C','D']</td><td>['1','1']</tr>
<tr><td>1</td><td>['A','E']</td><td>['1','1']</tr>
<tr><td>2</td><td>['C','F']</td><td>['2','1']</tr>
</table>
"""

tables = pd.read_html(s1)
df = tables[0]
# now we have the Pandas Dataframe with columns: 'FIRST', 'SECOND', 'COUNT'
# 'FIRST' contains integers
# 'SECOND' contains strings, not lists of strings
# 'COUNT' contains strings, not lists of integers

import ast
# convert 'SECOND' from string to list of strings
df['SECOND_LIST'] = df['SECOND'].apply(lambda x: ast.literal_eval(x))
# convert 'COUNT' from string to list of integers
df['COUNT_LIST'] = df['COUNT'].apply(lambda x: list(map(int, ast.literal_eval(x))))

from typing import List, Dict, Set  # import support for some type hints


def merge_dicts_sum(dict_list: List[Dict[str, int]]) -> Dict[str, int]:
    """
    merge a list of dicts to one dict, summing values for the same keys
    """
    keys: Set[str] = set()  # set of all unique keys of all dicts in list of dicts
    for dict_item in dict_list:
        keys.update(dict_item.keys())
    result_dict: Dict[str, int] = {}  # we will collect sums of values here
    for key in keys:
        result_dict[key] = 0  #
        for dict_item in dict_list:
            if key in dict_item:
                result_dict[key] += dict_item[key]
    return dict(sorted(result_dict.items()))  # sort result by key then return it

# create a new dataframe by grouping by field 'FIRST' and aggregating dicts from SECOND_COUNT_DICT to one list
df_gr = df.groupby('FIRST').agg(SECOND_COUNT_DICT_LIST=('SECOND_COUNT_DICT',list))

# megre dicts from 'SECOND_COUNT_DICT_LIST' to one dict using function 'merge_dicts_sum'
df_gr['MERGED_DICT'] = df_gr['SECOND_COUNT_DICT_LIST'].apply(merge_dicts_sum)

df_gr['SECOND'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.keys()))  # place keys from 'MERGED_DICT' to field 'SECOND'
df_gr['COUNT'] = df_gr['MERGED_DICT'].apply(lambda x: list(x.values()))  # place values from 'MERGED_DICT' to field 'COUNT'

df_result = df_gr[['SECOND', 'COUNT']]  # create a new dataframe by remaining only 'SECOND' and 'COUNT' fields

# store df_result as html
result_str = df_result.to_html(index=False)

print(f"{result_str}")