大熊猫广播价值分组给出了逻辑条件

时间:2016-09-27 12:22:56

标签: python pandas transform

根据以下示例,我有一个数据框架:

key1   key2     value1 
1      201501     NaN      
1      201502     NaN     
1      201503     201503      
1      201504     NaN      
2      201507     NaN
2      201508     NaN 
2      201509     NaN
3      201509     NaN 
3      201510     201509
3      201511     NaN
3      201512     NaN 
3      201513     NaN

我想要以下输出......

key1   key2     value1     value2
1      201501     NaN      0
1      201502     NaN      0
1      201503     201503   1   
1      201504     NaN      1
2      201507     NaN      0        
2      201508     NaN      0
2      201509     NaN      0
3      201509     NaN      0
3      201510     201509   1
3      201511     NaN      1
3      201512     NaN      1
3      201601     NaN      1

输出只是一个二进制标志,如果 value1 value1 中有yyyymm-stamp,则会接受 value1 ,然后保留它以提示其key1-组。在前面的行中,它应该是0.如果 key1 只有 np.NaN ,那么它应该是0,就像 key1 = 2一样。

我尝试过使用lambda运算符的应用程序,但它真的很慢。我希望有人可以给我一个关于如何使用更加矢量化的方法来广播它的提示,以节省一些执行时间。

以下df的代码!

非常感谢时间和投入!

致以最诚挚的问候,

/ swepab

import numpy as np

df = pd.DataFrame({'key1' : [1,1,1,1,2,2,2,3,3,3,3,3]
              ,'key2' : [201501, 201502,201503,201504,201507,201508,201509,201509,201510,201511,201512,201601]
              ,'value1' : [np.nan,np.nan,'201503',np.nan,np.nan,np.nan,np.nan,np.nan,'201509',np.nan,np.nan,np.nan]
              ,'value2' : [0,0,1,1,0,0,0,0,1,1,1,1]})

2 个答案:

答案 0 :(得分:0)

你需要的IIUC ffill

df['value2'] = df.groupby('key1')['value1'].ffill()
df.value2 = np.where(df.value2.notnull(),1,0)
print (df)
    key1    key2  value1  value2
0      1  201501     NaN       0
1      1  201502     NaN       0
2      1  201503  201503       1
3      1  201504     NaN       1
4      2  201507     NaN       0
5      2  201508     NaN       0
6      2  201509     NaN       0
7      3  201509     NaN       0
8      3  201510  201509       1
9      3  201511     NaN       1
10     3  201512     NaN       1
11     3  201601     NaN       1

答案 1 :(得分:0)

你可以这样做:

df['value2'] = df.groupby('key1')['value1'].apply(lambda x: (~pd.isnull(x)).cumsum())

In [50]: df
Out[50]:
key1    key2  value1  value2
0      1  201501     NaN       0
1      1  201502     NaN       0
2      1  201503  201503       1
3      1  201504     NaN       1
4      2  201507     NaN       0
5      2  201508     NaN       0
6      2  201509     NaN       0
7      3  201509     NaN       0
8      3  201510  201509       1
9      3  201511     NaN       1
10     3  201512     NaN       1
11     3  201601     NaN       1