Python Pandas DataFrame获取列值的所有组合吗?

时间:2018-08-01 19:16:42

标签: python pandas dataframe

我有一个熊猫数据框,看起来像这样:

        colour   points
0         red         1
1      yellow        10
2       black        -3

然后我正在尝试执行以下算法:

combos = []
points = []

for i1 in range(len(df)):
  for i2 in range(len(df)):

    colour_main      = df['colour'].values[i1]
    colour_secondary = df['colour'].values[i2]
    combo = colour_main + "_" + colour_secondary

    point1 = df['points'].values[i1]
    point2 = df['points'].values[i2]
    new_points = point1 + point2

    combos.append(combo)
    points.append(new_points)

df_new = pd.DataFrame({'colours': combos,
                       'points': points})

print(df_new)

我想获得所有组合和总和:

  • 如果颜色是主要颜色,我想总结一下他的值
  • 如果将颜色用作辅助颜色,我想对相反的值求和

示例:

red_yellow =  1 + (-10) = -9
red_black  =  1 + ( +3) =  4
black_red  = -3 + ( -1) = -4

我当前得到的输出:

         colours  points
0        red_red       2
1     red_yellow      11
2      red_black      -2
3     yellow_red      11
4  yellow_yellow      20
5   yellow_black       7
6      black_red      -2
7   black_yellow       7
8    blac_kblack      -6

我正在寻找的输出:

red_yellow       -9
red_black         4
yellow_red        9
yellow_black     13
black_red        -4
black_yellow    -13

我不知道如何将我的逻辑应用于此代码,我敢打赌,有一种更简单的方法无需进行两个循环即可获取所有组合,但是目前,这是我唯一想到的事情。

我想:

  • 获得应有的输出
  • 在获得20种输入颜色的情况下提高性能
  • 删除重复项,例如red_red

4 个答案:

答案 0 :(得分:4)

这是一些替代方案的timeit比较。

| method             | ms per loop |
|--------------------+-------------|
| alt2               |        2.36 |
| using_concat       |        3.26 |
| using_double_merge |        22.4 |
| orig               |        22.6 |
| alt                |        45.8 |

timeit结果是使用IPython生成的:

In [138]: df = make_df(20)

In [143]: %timeit alt2(df)
100 loops, best of 3: 2.36 ms per loop
In [140]: %timeit orig(df)
10 loops, best of 3: 22.6 ms per loop
In [142]: %timeit alt(df)
10 loops, best of 3: 45.8 ms per loop
In [169]: %timeit using_double_merge(df)
10 loops, best of 3: 22.4 ms per loop
In [170]: %timeit using_concat(df)
100 loops, best of 3: 3.26 ms per loop

import numpy as np
import pandas as pd


def alt(df):
    df['const'] = 1
    result = pd.merge(df, df, on='const', how='outer')
    result = result.loc[(result['colour_x'] != result['colour_y'])]

    result['color'] = result['colour_x'] + '_' + result['colour_y']
    result['points'] = result['points_x'] - result['points_y']
    result = result[['color', 'points']]
    return result

def alt2(df):
     points = np.add.outer(df['points'], -df['points'])
     color = pd.MultiIndex.from_product([df['colour'], df['colour']])
     mask = color.labels[0] != color.labels[1]
     color = color.map('_'.join)
     result = pd.DataFrame({'points':points.ravel(), 'color':color})
     result = result.loc[mask]
     return result

def orig(df):
    combos = []
    points = []

    for i1 in range(len(df)):
        for i2 in range(len(df)):
            colour_main = df['colour'].iloc[i1]
            colour_secondary = df['colour'].iloc[i2]
            if colour_main != colour_secondary:
                combo = colour_main + "_" + colour_secondary

                point1 = df['points'].values[i1]
                point2 = df['points'].values[i2]
                new_points = point1 - point2

                combos.append(combo)
                points.append(new_points)

    return pd.DataFrame({'color':combos, 'points':points})

def using_concat(df):
    """https://stackoverflow.com/a/51641085/190597 (RafaelC)"""
    d = df.set_index('colour').to_dict()['points']
    s = pd.Series(list(itertools.combinations(df.colour, 2)))
    s = pd.concat([s, s.transform(lambda k: k[::-1])])
    v = s.map(lambda k: d[k[0]] - d[k[1]])
    df2 = pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})
    return df2

def using_double_merge(df):
    """https://stackoverflow.com/a/51641007/190597 (sacul)"""    
    new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
           .reset_index()
           .drop(['colour', 'points'], 1)
           .merge(df.set_index('colour'), left_on='level_0', right_index=True)
           .merge(df.set_index('colour'), left_on='level_1', right_index=True))

    new['points_y'] *= -1
    new['sum'] = new.sum(axis=1)
    new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)
    new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)
    return new[['colours', 'sum']]

def make_df(N):
    df = pd.DataFrame({'colour': np.arange(N), 
                       'points': np.random.randint(10, size=N)})
    df['colour'] = df['colour'].astype(str)
    return df

alt2中的主要思想是使用np.add_outer来构造一个加法表 在df['points']中:

In [149]: points = np.add.outer(df['points'], -df['points'])
In [151]: points
Out[151]: 
array([[  0,  -9,   4],
       [  9,   0,  13],
       [ -4, -13,   0]])

ravel用于使数组成为一维数组:

In [152]: points.ravel()
Out[152]: array([  0,  -9,   4,   9,   0,  13,  -4, -13,   0])

和颜色组合是通过pd.MultiIndex.from_product生成的:

In [153]: color = pd.MultiIndex.from_product([df['colour'], df['colour']])
In [155]: color = color.map('_'.join)
In [156]: color
Out[156]: 
Index(['red_red', 'red_yellow', 'red_black', 'yellow_red', 'yellow_yellow',
       'yellow_black', 'black_red', 'black_yellow', 'black_black'],
      dtype='object')

将生成一个掩码以删除重复项:

mask = color.labels[0] != color.labels[1]

然后从这些部分生成result

 result = pd.DataFrame({'points':points.ravel(), 'color':color})
 result = result.loc[mask]

original answer, here中解释了alt背后的想法。

答案 1 :(得分:2)

这有点麻烦,但是可以为您提供所需的输出:

new = (df.reindex(pd.MultiIndex.from_product([df.colour, df.colour]))
       .reset_index()
       .drop(['colour', 'points'], 1)
       .merge(df.set_index('colour'), left_on='level_0', right_index=True)
       .merge(df.set_index('colour'), left_on='level_1', right_index=True))


new['points_x'] *= -1

new['sum'] = new.sum(axis=1)

new = new[new.level_0 != new.level_1].drop(['points_x', 'points_y'], 1)

new['colours'] = new[['level_0', 'level_1']].apply(lambda x: '_'.join(x),1)


>>> new
  level_0 level_1  sum       colours
3  yellow     red   -9    yellow_red
6   black     red    4     black_red
1     red  yellow    9    red_yellow
7   black  yellow   13  black_yellow
2     red   black   -4     red_black
5  yellow   black  -13  yellow_black

答案 2 :(得分:2)

d = df.set_index('colour').to_dict()['points']
s = pd.Series(list(itertools.combinations(df.colour, 2)))
s = pd.concat([s, s.transform(lambda k: k[::-1])])
v = s.map(lambda k: d[k[0]] - d[k[1]])
df2= pd.DataFrame({'comb': s.str.get(0)+'_' + s.str.get(1), 'values': v})

    comb             values
0   red_yellow       -9
1   red_black        4
2   yellow_black     13
0   yellow_red       9
1   black_red        -4
2   black_yellow    -13

答案 3 :(得分:1)

您必须在代码中更改此行

new_points = point1 + point2

对此

new_points = point1 - point2