提高基于其他值映射数据框中某些元素的性能

时间:2018-03-18 16:43:16

标签: python python-2.7 performance pandas dataframe

我有一张大型csv表,其中包含以下几行数据:

     clients   id    products type1 type2  value_x  value_y
0    bob       111   pen      A     X      100      5
1    zoe       112   glue     A     Y      33       3
2    alex      113   glue     B     Y      50       1
3    alex      114   pen      A     X      100      5
4    bob       115   pen      B     Y      70       1

我的目标是在新数据框中映射来自每个客户端的可比较ID,如下所示:

     bob  zoe alex
id                
111  111  NaN  114
112  NaN  112  NaN
113  NaN  NaN  113
114  111  NaN  114
115  115  NaN  NaN

我通过比较字段products,type1,type2,value_x,value_y来映射这些id。唯一的问题是我目前正在通过迭代每一行来实现这一点,由于实际数据帧的大小,这需要大约30分钟。此外,虽然products,type1和type2的值是固定的,value_x和value_y可以包含我将使用df.value_x.between(lower, upper)设置的容差度(为简单起见,下面的示例中没有)。 有没有办法加快这个过程,或者其他一些方法来映射这些可比较的值?

使用的代码:

data = {'type1': ['A', 'A', 'B', 'A', 'B'],
       'type2': ['X', 'Y', 'Y', 'X', 'Y'],
       'value_x': [100, 33, 50, 100, 70],
       'value_y': [5, 3, 1, 5, 1],
       'clients': ['bob', 'zoe', 'alex', 'alex', 'bob'],
       'id': ['111', '112', '113', '114', '115'],
       'products': ['pen', 'glue', 'glue', 'pen', 'pen']}

df = pd.DataFrame(data)

df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())

for row in df.itertuples():
    row_client = row[1]
    row_id = row[2]
    row_product = row[3]
    row_type1 = row[4]
    row_type2 = row[5]
    row_value_x = row[6]
    row_value_y = row[7]

    for client in df_mapped.columns:
        try:
            comparable_id = df[(df.clients == client) &
                               (df.type1 == row_type1) &
                               (df.type2 == row_type2) &
                               (df.value_x == row_value_x) &
                               (df.value_y == row_value_y) &
                               (df.products == row_product)]['id'].iloc[0]
        except IndexError:
            comparable_id = np.nan

        df_mapped.loc[row_id, client] = comparable_id

print df_mapped

1 个答案:

答案 0 :(得分:1)

我认为您的逻辑没有问题,但通过2次更改可以看到~12倍的加速:

  • 下拉到numpy进行数组标量比较。
  • 使用.iat访问器设置结果数据框条目。

以下基准测试结果。

import pandas as pd, numpy as np

data = {'type1': ['A', 'A', 'B', 'A', 'B'],
       'type2': ['X', 'Y', 'Y', 'X', 'Y'],
       'value_x': [100, 33, 50, 100, 70],
       'value_y': [5, 3, 1, 5, 1],
       'clients': ['bob', 'zoe', 'alex', 'alex', 'bob'],
       'id': ['111', '112', '113', '114', '115'],
       'products': ['pen', 'glue', 'glue', 'pen', 'pen']}

df = pd.DataFrame(data)

def jp(df):

    df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())

    df_values = df.values

    for i in range(df_values.shape[0]):
        row_client, row_id, row_product, row_type1, row_type2, row_value_x, row_value_y = df_values[i]

        for idx, client in enumerate(df_mapped):
            s = df.loc[(df_values[:, 0] == client) &
                       (df_values[:, 3] == row_type1) &
                       (df_values[:, 4] == row_type2) &
                       (df_values[:, 5] == row_value_x) &
                       (df_values[:, 6] == row_value_y) &
                       (df_values[:, 2] == row_product), 'id']

            try:
                comparable_id = s.iat[0]
            except IndexError:
                comparable_id = np.nan

            df_mapped.iat[i, idx] = comparable_id

    return df_mapped

def original(df):
    df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())

    for row in df.itertuples():
        row_client = row[1]
        row_id = row[2]
        row_product = row[3]
        row_type1 = row[4]
        row_type2 = row[5]
        row_value_x = row[6]
        row_value_y = row[7]

        for client in df_mapped:
            try:
                comparable_id = df[(df.clients == client) &
                                   (df.type1 == row_type1) &
                                   (df.type2 == row_type2) &
                                   (df.value_x == row_value_x) &
                                   (df.value_y == row_value_y) &
                                   (df.products == row_product)]['id'].iloc[0]
            except IndexError:
                comparable_id = np.nan

            df_mapped.loc[row_id, client] = comparable_id

    return df_mapped

assert original(df).equals(jp(df))

%timeit jp(df)        # 7.5ms
%timeit original(df)  # 99ms