Python:pandas列的条件算法

时间:2017-11-30 21:40:00

标签: python-3.x pandas

我的数据如下:

df = 
contact_id  date    answer
         1  1-May   9.00-10.00
         1  2-May   9.00-10.00
         1  3-May   11.00-12.00
         1  4-May   14.00-15.00
         1  5-May   10.00-11.00
         2  2-May   9.00-10.00
         2  3-May   9.00-10.00
         3  1-May   14.00-15.00
         3  2-May   10.00-11.00

我希望数据看起来像

df = 
    contact_id   date   9.00-10.00  10.00-11.00 11.00-12.00 14.00-15.00  answer
             1  1-May            0            0            0          0  9.00-10.00
             1  2-May            1            0            0          0  9.00-10.00
             1  3-May            2            0            0          0  11.00-12.00
             1  4-May            2            0            1          0  14.00-15.00
             1  5-May            2            0            1          1  10.00-11.00
             2  2-May            0            0            0          0  9.00-10.00
             2  3-May            1            0            0          0  9.00-10.00
             3  1-May            0            0            0          0  14.00-15.00
             3  2-May            0            0            0          1  10.00-11.00

[UPDATE] 所以我得到了一些似乎有效但不完全正确的逻辑。如果你能在这里建议我的话会很棒。以下是我的代码:

df1 = pd.concat([df,pd.get_dummies(df["answer"])],1)
df1["target"] = df1["answer"]
df1.drop("answer", inplace=True, axis=1)

tar_list =  df1["target"].unique().tolist()

def arith(x):

NinetoTenCounter = 0
EleventoTwelveCounter = 0
TwotoThreeCounter = 0
TentoElevenCounter = 0

for i in range(len(x["target"])):
    print(i)
    if x.iloc[i, -1] == "9.00-10.00":
        NinetoTenCounter += 1
        x.loc[i, '9.00-10.00'] = NinetoTenCounter - 1
        x.loc[i, '10.00-11.00'] = TentoElevenCounter
        x.loc[i, '11.00-12.00'] = EleventoTwelveCounter
        x.loc[i, '14.00-15.00'] = TwotoThreeCounter

    elif x.iloc[i,-1] == "11.00-12.00":
        EleventoTwelveCounter += 1
        x.loc[i, '9.00-10.00'] = NinetoTenCounter 
        x.loc[i, '10.00-11.00'] = TentoElevenCounter
        x.loc[i, '11.00-12.00'] = EleventoTwelveCounter - 1
        x.loc[i, '14.00-15.00'] = TwotoThreeCounter

    elif x.iloc[i,-1] == "14.00-15.00":
        TwotoThreeCounter += 1
        x.loc[i, '9.00-10.00'] = NinetoTenCounter 
        x.loc[i, '10.00-11.00'] = TentoElevenCounter 
        x.loc[i, '11.00-12.00'] = EleventoTwelveCounter 
        x.loc[i, '14.00-15.00'] = TwotoThreeCounter - 1

    else:
        TentoElevenCounter += 1
        x.loc[i, '9.00-10.00'] = NinetoTenCounter 
        x.loc[i, '10.00-11.00'] = TentoElevenCounter - 1
        x.loc[i, '11.00-12.00'] = EleventoTwelveCounter 
        x.loc[i, '14.00-15.00'] = TwotoThreeCounter

return x

df1_arith = df1.groupby("contact_id").apply(arith)

我得到的输出如下:

              contact_id   date  10.00-11.00  11.00-12.00  14.00-15.00  \
contact_id                                                               
1          0         1.0  1-may          0.0          0.0          0.0   
           1         1.0  2-may          0.0          0.0          0.0   
           2         1.0  3-may          0.0          0.0          0.0   
           3         1.0  4-may          0.0          1.0          0.0   
           4         1.0  5-may          0.0          1.0          1.0   
2          5         2.0  2-may          0.0          0.0          0.0   
           6         2.0  3-may          0.0          0.0          0.0   
           0         NaN    NaN          0.0          0.0          0.0   
           1         NaN    NaN          0.0          0.0          0.0   
3          7         3.0  1-may          0.0          0.0          1.0   
           8         3.0  2-may          1.0          0.0          0.0   
           0         NaN    NaN          0.0          0.0          0.0   
           1         NaN    NaN          0.0          0.0          1.0   

              9.00-10.00       target  
contact_id                             
1          0         0.0   9.00-10.00  
           1         1.0   9.00-10.00  
           2         2.0  11.00-12.00  
           3         2.0  14.00-15.00  
           4         2.0  10.00-11.00  
2          5         1.0   9.00-10.00  
           6         1.0   9.00-10.00  
           0         0.0          NaN  
           1         1.0          NaN  
3          7         0.0  14.00-15.00  
           8         0.0  10.00-11.00  
           0         0.0          NaN  
           1         0.0          NaN 

该函数似乎适用于第一组,即contact_id 1,而不完全适用于其他contact_id。所以现在的问题是:

1)为什么要插入NaN行?如何让它们不插入? 2)这是将功能应用于每个组的正确方法吗? 3)将功能应用于每个组的正确方法是什么?

如果我遗失了什么,请建议我。非常感谢,一如既往。

3 个答案:

答案 0 :(得分:2)

IIUC:

In [25]: (df.assign(col=df['answer'])
            .pivot_table(index=['contact_id','date','answer'], 
                         columns='col', aggfunc='size')
            .fillna(0)
            .reset_index())
Out[25]:
col  contact_id   date       answer  10.00-11.00  11.00-12.00  14.00-15.00  9.00-10.00
0             1  1-May   9.00-10.00          0.0          0.0          0.0         1.0
1             1  2-May   9.00-10.00          0.0          0.0          0.0         1.0
2             1  3-May  11.00-12.00          0.0          1.0          0.0         0.0
3             1  4-May  14.00-15.00          0.0          0.0          1.0         0.0
4             1  5-May  10.00-11.00          1.0          0.0          0.0         0.0
5             2  2-May   9.00-10.00          0.0          0.0          0.0         1.0
6             2  3-May   9.00-10.00          0.0          0.0          0.0         1.0
7             3  1-May  14.00-15.00          0.0          0.0          1.0         0.0
8             3  2-May  10.00-11.00          1.0          0.0          0.0         0.0

答案 1 :(得分:2)

pd.concat([df,pd.get_dummies(df.answer)],1)
Out[1272]: 
   contact_id   date       answer  10.00-11.00  11.00-12.00  14.00-15.00  9.00-10.00
0           1  1-May   9.00-10.00            0            0            0           1
1           1  2-May   9.00-10.00            0            0            0           1
2           1  3-May  11.00-12.00            0            1            0           0
3           1  4-May  14.00-15.00            0            0            1           0
4           1  5-May  10.00-11.00            1            0            0           0
5           2  2-May   9.00-10.00            0            0            0           1
6           2  3-May   9.00-10.00            0            0            0           1
7           3  1-May  14.00-15.00            0            0            1           0
8           3  2-May  10.00-11.00            1            0            0           0

答案 2 :(得分:0)

我设法以非常优雅的方式解决了我的问题。如果有人需要参考,请在此处评论,我将分享我的解决方案。谢谢。