我有一张大型csv表,其中包含以下几行数据:
clients id products type1 type2 value_x value_y
0 bob 111 pen A X 100 5
1 zoe 112 glue A Y 33 3
2 alex 113 glue B Y 50 1
3 alex 114 pen A X 100 5
4 bob 115 pen B Y 70 1
我的目标是在新数据框中映射来自每个客户端的可比较ID,如下所示:
bob zoe alex
id
111 111 NaN 114
112 NaN 112 NaN
113 NaN NaN 113
114 111 NaN 114
115 115 NaN NaN
我通过比较字段products,type1,type2,value_x,value_y来映射这些id。唯一的问题是我目前正在通过迭代每一行来实现这一点,由于实际数据帧的大小,这需要大约30分钟。此外,虽然products,type1和type2的值是固定的,value_x和value_y可以包含我将使用df.value_x.between(lower, upper)
设置的容差度(为简单起见,下面的示例中没有)。
有没有办法加快这个过程,或者其他一些方法来映射这些可比较的值?
使用的代码:
data = {'type1': ['A', 'A', 'B', 'A', 'B'],
'type2': ['X', 'Y', 'Y', 'X', 'Y'],
'value_x': [100, 33, 50, 100, 70],
'value_y': [5, 3, 1, 5, 1],
'clients': ['bob', 'zoe', 'alex', 'alex', 'bob'],
'id': ['111', '112', '113', '114', '115'],
'products': ['pen', 'glue', 'glue', 'pen', 'pen']}
df = pd.DataFrame(data)
df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())
for row in df.itertuples():
row_client = row[1]
row_id = row[2]
row_product = row[3]
row_type1 = row[4]
row_type2 = row[5]
row_value_x = row[6]
row_value_y = row[7]
for client in df_mapped.columns:
try:
comparable_id = df[(df.clients == client) &
(df.type1 == row_type1) &
(df.type2 == row_type2) &
(df.value_x == row_value_x) &
(df.value_y == row_value_y) &
(df.products == row_product)]['id'].iloc[0]
except IndexError:
comparable_id = np.nan
df_mapped.loc[row_id, client] = comparable_id
print df_mapped
答案 0 :(得分:1)
我认为您的逻辑没有问题,但通过2次更改可以看到~12倍的加速:
numpy
进行数组标量比较。.iat
访问器设置结果数据框条目。以下基准测试结果。
import pandas as pd, numpy as np
data = {'type1': ['A', 'A', 'B', 'A', 'B'],
'type2': ['X', 'Y', 'Y', 'X', 'Y'],
'value_x': [100, 33, 50, 100, 70],
'value_y': [5, 3, 1, 5, 1],
'clients': ['bob', 'zoe', 'alex', 'alex', 'bob'],
'id': ['111', '112', '113', '114', '115'],
'products': ['pen', 'glue', 'glue', 'pen', 'pen']}
df = pd.DataFrame(data)
def jp(df):
df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())
df_values = df.values
for i in range(df_values.shape[0]):
row_client, row_id, row_product, row_type1, row_type2, row_value_x, row_value_y = df_values[i]
for idx, client in enumerate(df_mapped):
s = df.loc[(df_values[:, 0] == client) &
(df_values[:, 3] == row_type1) &
(df_values[:, 4] == row_type2) &
(df_values[:, 5] == row_value_x) &
(df_values[:, 6] == row_value_y) &
(df_values[:, 2] == row_product), 'id']
try:
comparable_id = s.iat[0]
except IndexError:
comparable_id = np.nan
df_mapped.iat[i, idx] = comparable_id
return df_mapped
def original(df):
df_mapped = pd.DataFrame(index=df.id, columns=df.clients.unique())
for row in df.itertuples():
row_client = row[1]
row_id = row[2]
row_product = row[3]
row_type1 = row[4]
row_type2 = row[5]
row_value_x = row[6]
row_value_y = row[7]
for client in df_mapped:
try:
comparable_id = df[(df.clients == client) &
(df.type1 == row_type1) &
(df.type2 == row_type2) &
(df.value_x == row_value_x) &
(df.value_y == row_value_y) &
(df.products == row_product)]['id'].iloc[0]
except IndexError:
comparable_id = np.nan
df_mapped.loc[row_id, client] = comparable_id
return df_mapped
assert original(df).equals(jp(df))
%timeit jp(df) # 7.5ms
%timeit original(df) # 99ms