这是一个数据框的示例,
id Section A B
0 abc foo 0.1 0.6
1 abc foo 0.2 0.3
2 abc bar 0.5 0.1
3 def foo 0.1 0.1
4 def bar 0.1 0.3
5 def bar 0.6 0.1
6 ghj foo 0.3 0.1
7 ghj foo 0.1 0.7
8 ghj bar 0.1 0.2
要从以下列表中创建新列df['AA', 'BB']
。
A_foo = [0.1,2]
A_bar = [1,0.3]
B_foo = [0.4,0.2]
B_bar = [1.2,0.5]
这是我到目前为止的尝试,
g = df.groupby('id')['A','B']
for i, i_d in g:
print(i_d)
**
length of `A_foo, A_bar, B_foo and B_bar` is always greater or equal to df`
[df.Section == 'foo'] and df[df.Section == 'bar']` of any unique id.
然后为每个ID df['AA']
中的每个'foo' and 'bar'
创建df['Section']
,我想从A_foo and A_bar
中获取相应的值。
例如,在第一个i_d(id = abc)中,df.A
具有two 'foo' and one 'bar'
,则df['AA']
的前三行将显示,
[0.1,2,1... 0.1 and 2 from A_foo and 1 from A_bar
然后在第二个i_d(id='def')
,df.A has one foo and two bar
中添加,因此我需要添加0.1 from A_foo and 1,0.3 from A_bar
。
现在
df['AA'] will look [0.1,2,1,0.1,1,0.3...
从上一个i_d开始,我将收集0.1,2 from A_foo and 1 from A_bar.
现在完整的
df['AA'] = [0.1,2,1,0.1,1,0.3,0.1,2,1]
类似地,从df['BB']
和B_foo
创建B_bar
df['BB'] = [0.4,0.2,1.2,0.4,1.2,0.5,0.4,0.2,1.2]
这是最终的df
id Section A B AA BB
0 abc foo 0.1 0.6 0.1 0.4
1 abc foo 0.2 0.3 2.0 0.2
2 abc bar 0.5 0.1 1.0 1.2
3 def foo 0.1 0.1 0.1 0.4
4 def bar 0.1 0.3 1.0 1.2
5 def bar 0.6 0.1 0.3 0.5
6 ghj foo 0.3 0.1 0.1 0.4
7 ghj foo 0.1 0.7 2.0 0.2
8 ghj bar 0.1 0.2 1.0 1.2
答案 0 :(得分:2)
使用groupby
+ cumcount
创建索引,然后使用np.select
从相应列表中分配值。
import numpy as np
df['idx'] = df.groupby(['id', 'Section']).cumcount()
conds = [df.Section.eq('foo'), df.Section.eq('bar')]
AA_choice = [np.array(A_foo)[df.idx], np.array(A_bar)[df.idx]]
BB_choice = [np.array(B_foo)[df.idx], np.array(B_bar)[df.idx]]
df['AA'] = np.select(conds, AA_choice, default=np.NaN)
df['BB'] = np.select(conds, BB_choice, default=np.NaN)
id Section A B idx AA BB
0 abc foo 0.1 0.6 0 0.1 0.4
1 abc foo 0.2 0.3 1 2.0 0.2
2 abc bar 0.5 0.1 0 1.0 1.2
3 def foo 0.1 0.1 0 0.1 0.4
4 def bar 0.1 0.3 0 1.0 1.2
5 def bar 0.6 0.1 1 0.3 0.5
6 ghj foo 0.3 0.1 0 0.1 0.4
7 ghj foo 0.1 0.7 1 2.0 0.2
8 ghj bar 0.1 0.2 0 1.0 1.2
如果列表不够长,您将获得IndexError
。如果是这样,请考虑通过以下方式进行切片:np.array(A_foo)[df.idx%len(A_foo)]