Itertool组合Python数据框

时间:2018-09-10 16:20:34

标签: python pandas

以某种方式我找不到适合我问题的解决方案。我想在相同的fact_date中使用相等的值计算Unit和Scen列之间的总和。输出应如下所示:

输出:

Combination Unit_Com    Scen    Value_Sum   Town    Country
11-Apr      a,b         1       28          Town A  USA
11-Apr      a,b         2       31          Town A  USA
11-Apr      a,c         1       30          Town A  USA
11-Apr      a,c         2       30          Town A  USA
11-Apr      a,d         1       31          Town A  USA
11-Apr      a,d         2       29          Town A  USA
11-Apr      b,c         1       32          Town A  USA
11-Apr      b,c         2       39          Town A  USA
11-Apr      b,d         1       33          Town A  USA
11-Apr      b,d         2       38          Town A  USA
11-Apr      c,d         1       35          Town A  USA
11-Apr      c,d         2       37          Town A  USA
10-Apr      a,b         1       28          Town A  USA
10-Apr      a,b         2       25          Town A  USA
10-Apr      a,c         1       32          Town A  USA
10-Apr      a,c         2       26          Town A  USA
10-Apr      a,d         1       38          Town A  USA
10-Apr      a,d         2       22          Town A  USA
10-Apr      b,c         1       24          Town A  USA
10-Apr      b,c         2       27          Town A  USA
10-Apr      b,d         1       30          Town A  USA
10-Apr      b,d         2       23          Town A  USA
10-Apr      c,d         1       34          Town A  USA
10-Apr      c,d         2       24          Town A  USA

按以下方式计算:

fact_date: 11-Apr
Town: Town A
Country: USA

Unit: a
Scen(Unit a): 1
Value: 13

Unit: b
Scen(Unit a): 1
Value: 15

**Output (as shown above):**
fact_date: 11-Apr
Unit_Combo: a,b
Scen: 1
Value_Sum: 28
Town: Town A
Country USA

然后应在每个事实日期执行此操作。

最后, Town A and Town B 与之之间的组合,例如a,e等

不幸的是,我没有收到任何组合,我被困在这里:

更新:

我更新了代码,但是仍然以某种方式接收到错误的输出

calculating date: 11-Apr
11-Apr 1,1 a,b Town A,Town A USA,USA 28
11-Apr 1,2 a,b Town A,Town A USA,USA 33
11-Apr 1,1 a,c Town A,Town A USA,USA 30
11-Apr 1,2 a,c Town A,Town A USA,USA 32
11-Apr 1,1 a,d Town A,Town A USA,USA 31
11-Apr 1,2 a,d Town A,Town A USA,USA 31
11-Apr 1,1 a,b Town A,Town A USA,USA 23
11-Apr 1,2 a,b Town A,Town A USA,USA 26
11-Apr 1,1 a,c Town A,Town A USA,USA 27
11-Apr 1,2 a,c Town A,Town A USA,USA 27
11-Apr 1,1 a,d Town A,Town A USA,USA 33
11-Apr 1,2 a,d Town A,Town A USA,USA 23
calculating date: 10-Apr
10-Apr 2,1 a,b Town A,Town A USA,USA 26
10-Apr 2,2 a,b Town A,Town A USA,USA 31
10-Apr 2,1 a,c Town A,Town A USA,USA 28
10-Apr 2,2 a,c Town A,Town A USA,USA 30
10-Apr 2,1 a,d Town A,Town A USA,USA 29
10-Apr 2,2 a,d Town A,Town A USA,USA 29
10-Apr 2,1 a,b Town A,Town A USA,USA 21
10-Apr 2,2 a,b Town A,Town A USA,USA 24
10-Apr 2,1 a,c Town A,Town A USA,USA 25
10-Apr 2,2 a,c Town A,Town A USA,USA 25
10-Apr 2,1 a,d Town A,Town A USA,USA 31
10-Apr 2,2 a,d Town A,Town A USA,USA 21

代码如下:

import pandas as pd

df = pd.DataFrame({'fact_date': ['11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','11-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr','10-Apr'],
                   'Unit': ['a','a','b','b','c','c','d','d','a','a','b','b','c','c','d','d','e','e','f','f','g','g','h','h','i','i','j','j','k','k','l','l'],
                   'Town': ['Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town A','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B','Town B'],
                   'Scen': [1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
                   'Value': [13,11,15,20,17,19,18,18,18,12,10,13,14,14,20,10,18,17,15,19,11,14,14,17,19,10,16,10,16,19,12,11],
                   'Country': ['USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA','USA']})


test_df = pd.DataFrame([])

cluster_names = df['fact_date'].unique()
disjoint_clusters = []
for idx,item in enumerate(cluster_names):
    df[df['fact_date'] == item]

    print('calculating date: ' +str(item))

    for j in range(idx+1, len(df)):
        if df.iloc[idx]['Unit'] != df.iloc[j]['Unit'] and df.iloc[idx]['Town'] == 'Town A' and df.iloc[j]['Town'] == 'Town A':

            print(item,
                  str(df.iloc[idx]['Scen'])+str(',')+str(df.iloc[j]['Scen']), 
                  df.iloc[idx]['Unit']+str(',')+df.iloc[j]['Unit'],
                  df.iloc[idx]['Town']+str(',')+df.iloc[j]['Town'],
                  df.iloc[idx]['Country']+str(',')+df.iloc[j]['Country'],
                  df.iloc[idx]['Value']+df.iloc[j]['Value'])

1 个答案:

答案 0 :(得分:0)

因此,这种方式将为您提供问题的预期输出。想法是在列groupby上使用'fact_date','Country','Town','Scen',然后在分组数据帧上的combinations中使用itertools来填充'Value','Unit'列中的值。您可以使用列表推导和pd.DataFrame直接创建结果数据框:

from itertools import combinations
df_res = pd.DataFrame([list(name_g) + [val1+val2,'{},{}'.format(unit1,unit2)] 
                       for name_g, df_g in df.groupby(['fact_date','Country','Town','Scen']) 
                       for ((val1, unit1), (val2, unit2)) in combinations(df_g[['Value','Unit']].values,2)],
                      columns=['Combination','Country','Town','Scen','Value_Sum','Unit_Com'])

您可能需要对列进行重新排序并获得相同的输出范围,然后可以执行以下操作:

print (df_res[df_res['Town'] == 'Town A'])
   Combination Country    Town  Scen  Value_Sum Unit_Com
0       10-Apr     USA  Town A     1         28      a,b
1       10-Apr     USA  Town A     1         32      a,c
2       10-Apr     USA  Town A     1         38      a,d
3       10-Apr     USA  Town A     1         24      b,c
4       10-Apr     USA  Town A     1         30      b,d
5       10-Apr     USA  Town A     1         34      c,d
6       10-Apr     USA  Town A     2         25      a,b
7       10-Apr     USA  Town A     2         26      a,c
8       10-Apr     USA  Town A     2         22      a,d
9       10-Apr     USA  Town A     2         27      b,c
10      10-Apr     USA  Town A     2         23      b,d
11      10-Apr     USA  Town A     2         24      c,d
24      11-Apr     USA  Town A     1         28      a,b
25      11-Apr     USA  Town A     1         30      a,c
26      11-Apr     USA  Town A     1         31      a,d
27      11-Apr     USA  Town A     1         32      b,c
28      11-Apr     USA  Town A     1         33      b,d
29      11-Apr     USA  Town A     1         35      c,d
30      11-Apr     USA  Town A     2         31      a,b
31      11-Apr     USA  Town A     2         30      a,c
32      11-Apr     USA  Town A     2         29      a,d
33      11-Apr     USA  Town A     2         39      b,c
34      11-Apr     USA  Town A     2         38      b,d
35      11-Apr     USA  Town A     2         37      c,d

编辑:对于使用Town做同样的事情,您可以这样做:

df_res = pd.DataFrame([list(name_g) + [val1+val2,'{},{}'.format(unit1,unit2), '{},{}'.format(town1,town2)] 
                       for name_g, df_g in df.groupby(['fact_date','Country','Scen']) 
                       for ((val1, unit1, town1), (val2, unit2, town2)) in combinations(df_g[['Value','Unit','Town']].values,2)],
                      columns=['Combination','Country','Scen','Value_Sum','Unit_Com','Town'])

看到的区别是,Town列不再在groupby中,而是在combinations中选择的列中,并进行了一些小的改动以使其正常工作。

要随机选择这些组合,我建议您看一下函数sample,例如,如果要使用其中的10种,可以执行以下操作:

print (df_res.sample(n=10))
    Combination Country  Scen  Value_Sum Unit_Com           Town
7        10-Apr     USA     1         24      b,c  Town A,Town A
66       11-Apr     USA     1         30      b,f  Town A,Town B
31       10-Apr     USA     2         22      a,i  Town A,Town B
18       10-Apr     USA     1         39      d,i  Town A,Town B
72       11-Apr     USA     1         28      c,g  Town A,Town B
109      11-Apr     USA     2         33      f,g  Town B,Town B
41       10-Apr     USA     2         24      c,d  Town A,Town A
99       11-Apr     USA     2         38      c,f  Town A,Town B
84       11-Apr     USA     2         31      a,b  Town A,Town A
88       11-Apr     USA     2         30      a,f  Town A,Town B