我在merge
上遇到两个熊猫数据帧的麻烦。
我有两个与此类似的数据框:
团队
date team_member_1 team_member_2
0 2017-11-21 1 6
1 2017-11-21 2 7
2 2017-11-21 3 8
3 2017-11-21 4 9
4 2017-11-21 5 10
5 2018-01-01 1 10
6 2018-01-01 2 9
7 2018-01-01 3 8
8 2018-01-01 4 7
9 2018-01-01 5 6
名称:
date designation ids
0 2017-11-21 a [1, 10]
1 2017-11-21 b [2, 9]
2 2017-11-21 c [3, 8]
3 2017-11-21 d [4, 7]
4 2017-11-21 e [5, 6]
5 2018-01-01 f [1, 2]
6 2018-01-01 g [3, 4]
7 2018-01-01 h [5, 6]
8 2018-01-01 i [7, 8]
9 2018-01-01 j [9, 10]
现在,我需要将列team_member_1_designation
添加到teams
表中。我的方法是先将designations
表分解为如下所示,然后将其与teams
和date
上的member_id
合并:
date designation id
0 2017-11-21 a 1
1 2017-11-21 a 10
2 2017-11-21 b 2
3 2017-11-21 b 9
4 2017-11-21 c 3
5 2017-11-21 c 8
6 2017-11-21 d 4
7 2017-11-21 d 7
8 2017-11-21 e 5
9 2017-11-21 e 6
10 2018-01-01 f 1
11 2018-01-01 f 2
12 2018-01-01 g 3
13 2018-01-01 g 4
14 2018-01-01 h 5
15 2018-01-01 h 6
16 2018-01-01 i 7
17 2018-01-01 i 8
18 2018-01-01 j 9
19 2018-01-01 j 10
我编写的用于爆炸designations
表的代码是:
designations.set_index(designations.columns.drop('ids', 1).tolist()).ids.apply(pd.Series).stack().reset_index().rename(columns={0: 'id'})
但是当桌子很大时,这种爆炸操作要花费很长时间(假设我每天有5万个团队/团队成员的名称和团队为期20年)
有没有更便宜的方法可以将team_member_1_designation
列添加到teams
表中而不爆炸designations
表?
答案 0 :(得分:0)
您可以使用map
:
#create dictionary with keys created by tuples
z = zip(designations['date'], designations['designation'], designations['ids'])
d = {(i, x):j for i, j, k in z for x in k}
d = {('2017-11-21', 1): 'a', ('2017-11-21', 10): 'a', ('2017-11-21', 2): 'b',
('2017-11-21', 9): 'b', ('2017-11-21', 3): 'c', ('2017-11-21', 8): 'c',
('2017-11-21', 4): 'd', ('2017-11-21', 7): 'd', ('2017-11-21', 5): 'e',
('2017-11-21', 6): 'e', ('2018-01-01', 1): 'f', ('2018-01-01', 2): 'f',
('2018-01-01', 3): 'g', ('2018-01-01', 4): 'g', ('2018-01-01', 5): 'h',
('2018-01-01', 6): 'h', ('2018-01-01', 7): 'i', ('2018-01-01', 8): 'i',
('2018-01-01', 9): 'j', ('2018-01-01', 10): 'j'}
#convert 2 columns to tuples
s = pd.Series(list(map(tuple, teams[['date','team_member_1']].values.tolist())))
print (s)
0 (2017-11-21, 1)
1 (2017-11-21, 2)
2 (2017-11-21, 3)
3 (2017-11-21, 4)
4 (2017-11-21, 5)
5 (2018-01-01, 1)
6 (2018-01-01, 2)
7 (2018-01-01, 3)
8 (2018-01-01, 4)
9 (2018-01-01, 5)
dtype: object
teams['id'] = s.map(d)
print (teams)
date team_member_1 team_member_2 id
0 2017-11-21 1 6 a
1 2017-11-21 2 7 b
2 2017-11-21 3 8 c
3 2017-11-21 4 9 d
4 2017-11-21 5 10 e
5 2018-01-01 1 10 f
6 2018-01-01 2 9 f
7 2018-01-01 3 8 g
8 2018-01-01 4 7 g
9 2018-01-01 5 6 h
如果需要良好的性能解决方案,我认为.apply(pd.Series)
是不建议的。
最好使用DataFrame
构造函数:
cols = designations.columns.difference(['ids']).tolist()
df1 = designations.set_index(cols)['ids']
df2 = pd.DataFrame(df1.values.tolist(), index=df1.index).stack().reset_index(name='id')
或者numpy解决方案:
from itertools import chain
idx = designations.index.repeat(designations['ids'].str.len())
df2 =(designations.reindex(idx)
.assign(id=list(chain.from_iterable(designations['ids'].tolist())))
.drop('ids', axis=1))
teams = teams.merge(df2.rename(columns={'id':'team_member_1'}),
on=['date','team_member_1'],
how='left')
print (teams)
date team_member_1 team_member_2 designation
0 2017-11-21 1 6 a
1 2017-11-21 2 7 b
2 2017-11-21 3 8 c
3 2017-11-21 4 9 d
4 2017-11-21 5 10 e
5 2018-01-01 1 10 f
6 2018-01-01 2 9 f
7 2018-01-01 3 8 g
8 2018-01-01 4 7 g
9 2018-01-01 5 6 h