我正在尝试对person_id值进行编码。首先,我要创建一个存储该person_id值的字典,然后将这些值添加到新列中。处理70K行数据大约需要25分钟。
数据集:https://www.kaggle.com/gspmoreira/articles-sharing-reading-from-cit-deskdrop
interactions_df = pd.read_csv('./users_interactions.csv')
personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))
for i in zip(personId_len, set(interactions_df['personId'])):
personId_map[i[0]] = i[1]
运行
%%time
def transform_person_id(row):
if row['personId'] in personId_map.values():
return int([k for k,v in personId_map.items() if v == row['personId']][0])
interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df.head(3)
消耗的时间
CPU times: user 25min 46s, sys: 1.58 s, total: 25min 48s
Wall time: 25min 50s
如何优化上面的代码。
答案 0 :(得分:1)
如果没有特殊的订购规则,请使用factorize
:
interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]
如果还需要字典:
i, v = pd.factorize(interactions_df.personId)
personId_map = dict(zip(i, v[i]))
Data
-测试的前20行:
interactions_df = pd.read_csv('./users_interactions.csv', nrows=20, usecols=['personId'])
#print (interactions_df)
personId_map = {}
personId_len = range(0,len(set(interactions_df['personId'])))
for i in zip(personId_len, set(interactions_df['personId'])):
personId_map[i[0]] = i[1]
#print (personId_map)
def transform_person_id(row):
if row['personId'] in personId_map.values():
return int([k for k,v in personId_map.items() if v == row['personId']][0])
interactions_df['new_personId'] = interactions_df.apply(lambda x: transform_person_id(x), axis=1)
interactions_df['new_personId1'] = pd.factorize(interactions_df.personId)[0]
print (interactions_df)
personId new_personId new_personId1
0 -8845298781299428018 3 0
1 -1032019229384696495 5 1
2 -1130272294246983140 9 2
3 344280948527967603 6 3
4 -445337111692715325 0 4
5 -8763398617720485024 10 5
6 3609194402293569455 4 6
7 4254153380739593270 8 7
8 344280948527967603 6 3
9 3609194402293569455 4 6
10 3609194402293569455 4 6
11 1908339160857512799 11 8
12 1908339160857512799 11 8
13 1908339160857512799 11 8
14 7781822014935525018 1 9
15 8239286975497580612 2 10
16 8239286975497580612 2 10
17 -445337111692715325 0 4
18 2766187446275090740 7 11
19 1908339160857512799 11 8
i, v = pd.factorize(interactions_df.personId)
d = dict(zip(i, v[i]))
print (d)
{0: -8845298781299428018, 1: -1032019229384696495, 2: -1130272294246983140,
3: 344280948527967603, 4: -445337111692715325, 5: -8763398617720485024,
6: 3609194402293569455, 7: 4254153380739593270, 8: 1908339160857512799,
9: 7781822014935525018, 10: 8239286975497580612, 11: 2766187446275090740}
性能:
interactions_df = pd.read_csv('./users_interactions.csv', usecols=['personId'])
#print (interactions_df)
In [243]: %timeit interactions_df['new_personId'] = pd.factorize(interactions_df.personId)[0]
2.03 ms ± 15.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)