Question

我有一个看起来像这样的数据框。

s_id  h_id   h_val  h_others
1      600     5    {700,500}
1      700     12   {600,500,400}
1      500     6    {600,700}
2     ...     ...    ...

我想做的是，当按s_id分组时，遍历h_others，看看是否在h_id中找到了针对该特定s_id的字典中的每个ID。。如果找到了，我想映射它在h_val中可以找到的值，将它们加起来，并用h_others的映射值之和创建一个新列。如果找不到，则ID可以直接映射为0，这样就不会影响总和。

预期输出：

s_id  h_id   h_val  h_others       sum_h_others
1      600     5    {700,500}       18     
1      700     12   {600,500,400}   11
1      500     6    {600,700}       17     
2     ...     ...    ...

Answer 1

让我们从@WeNYoBen借用unnesting函数，但对其进行一些修改，使其可以与您的集合一起使用。然后可以通过合并完成计算。

from itertools import chain 

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: [*chain.from_iterable(df[x].to_numpy())]}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

df1 = unnesting(df, explode=['h_others'])

s = (df1.reset_index().merge(df.reset_index(), 
                             left_on=['h_others', 's_id'], 
                             right_on=['h_id', 's_id'])
         .query('index_x != index_y')
         .groupby('index_x').h_val_y.sum())

df['sum_h_others'] = s

输出：

   s_id  h_id  h_val         h_others  sum_h_others
0     1   600      5       {700, 500}            18
1     1   700     12  {600, 500, 400}            11
2     1   500      6       {600, 700}            17

一个更直接的选择是取消嵌套后进行映射，但是套用会使它变慢：

d = {(k1, k2): v for k1, k2, v in zip(*df[['s_id', 'h_id', 'h_val']].to_numpy().T)}
#{(1, 500): 6, (1, 600): 5, (1, 700): 12}

df['sum_h_others'] = df1[['s_id', 'h_others']].apply(tuple, 1).map(d).groupby(level=0).sum()

Answer 2

这是执行此操作的一种可能方法：

import pandas as pd
import ast
from io import StringIO
df = pd.read_table(StringIO("""s_id  h_id   h_val  h_others
1      600     5    {700,500}
1      700     12   {600,500,400}
1      500     6    {600,700}"""), sep='\s+')

summs = []
for s_id, s in list(zip(df.s_id, df.h_others.values)):
    df['sum_h_others'] = 0
    summ = 0
    for d in ast.literal_eval(s):
        try:
            summ += sum(df.loc[df['s_id'] == s_id].loc[(df['h_id'] == d), 'h_val'].values)
        except IndexError:
            pass
    summs.append(summ)
df['sum_h_others'] = summs

输出：

   s_id  h_id  h_val       h_others  sum_h_others
0     1   600      5      {700,500}            18
1     1   700     12  {600,500,400}            11
2     1   500      6      {600,700}            17

按ID分组时将值从1列映射到另一列（如果存在）

2 个答案:

输出：