顶部底部配对基于pandas数据帧中的列值

时间:2018-02-21 22:24:39

标签: python pandas

我想基于其得分列中的值从DataFrame生成扇区/组智能对。

+---------+-------------------+---------+
|  Ticker |      Sector       |   Score |   
+---------+-------------------+---------+
|   ABC   |    Energy         |    3.5  |     
|   XYZ   |    Energy         |    4.5  |     
|   PQR   |    Tech           |    5.5  |     
|   MNP   |    Tech           |    1.5  |     
|   JKL   |    Energy         |   10.5  |     
|   BCA   |    Energy         |    8.5  |     
|   RDB   |    Tech           |    6.5  |
|   JMP   |    Tech           |    2.5  |
+---------+-------------------+---------+

从上面的例子中,能量JKL / ABC将是一个这样的配对,因为JKL最高,ABC是该部门中的最低得分者。类似,能量中的下一个配对将是BCA / XYZ,因为BCA是第二高,XYZ是第二该领域内最低。

作为下一步,我希望在每个扇区中保留这些对,其中对差大于某个阈值。

感谢您的帮助。

输出可以是

+---------+-------------------+---------+
|  Ticker |      Sector       |  Result |   
+---------+-------------------+---------+
|   ABC   |    Energy         |    0    |     
|   XYZ   |    Energy         |    0    |     
|   PQR   |    Tech           |    1    |     
|   MNP   |    Tech           |    0    |     
|   JKL   |    Energy         |    1    |     
|   BCA   |    Energy         |    1    |     
|   RDB   |    Tech           |    1    |
|   JMP   |    Tech           |    0    |
+---------+-------------------+---------+

2 个答案:

答案 0 :(得分:1)

这就是你想要的吗?

(
    df.groupby('Sector')
    .apply(lambda x: [df.Ticker.iloc[x.Score.idxmin()],
                      df.Ticker.iloc[x.Score.idxmax()],
                      x.Score.idxmin(), x.Score.idxmax()])
    .apply(pd.Series)
    .set_axis(['Low Ticker', 'High Ticker', 'Low', 'High'],
              axis=1, inplace=False)
    .assign(Diff = lambda x: x.High-x.Low)
)

Out[653]: 
          Low Ticker High Ticker  Low  High  Diff
Sector                                           
Energy           ABC         JKL    0     4     4
Utilities        MNP         RDB    3     6     3

然后,您可以通过过滤Diff列来保留对中差异大于某个阈值的每个扇区内的那些对。

答案 1 :(得分:0)

这就是我要做的事情

df=df.sort_values('Score')
df=df.assign(New=df.groupby('Sector').cumcount()%2)

df=df.assign(New2=(df.groupby('Sector').New.apply(lambda x :x.cumsum().replace(0,len(x)/2))))


df.groupby(['Sector','New2']).Ticker.apply(list)
Out[1464]:
Sector     New2
Energy     1       [XYZ, BCA]
           2       [ABC, JKL]
Utilities  1       [JMP, PQR]
           2       [MNP, RDB]
Name: Ticker, dtype: object

然后

df['Result']=(df.Score==df.groupby(['Sector','New2']).Score.transform('max')).astype(int)
df.sort_index()
Out[1471]: 
  Ticker     Sector  Score  New  New2  Result
0    ABC     Energy    3.5    0     2       0
1    XYZ     Energy    4.5    1     1       0
2    PQR  Utilities    5.5    0     1       1
3    MNP  Utilities    1.5    0     2       0
4    JKL     Energy   10.5    1     2       1
5    BCA     Energy    8.5    0     1       1
6    RDB  Utilities    6.5    1     2       1
7    JMP  Utilities    2.5    1     1       0

修改:根据操作添加diff

df['DIFF']=df.groupby(['Sector','New2']).Score.apply(lambda x : x.diff().bfill())
df.sort_index()
Out[1479]: 
  Ticker     Sector  Score  New  New2  Result  DIFF
0    ABC     Energy    3.5    0     2       0   7.0
1    XYZ     Energy    4.5    1     1       0   4.0
2    PQR  Utilities    5.5    0     1       1   3.0
3    MNP  Utilities    1.5    0     2       0   5.0
4    JKL     Energy   10.5    1     2       1   7.0
5    BCA     Energy    8.5    0     1       1   4.0
6    RDB  Utilities    6.5    1     2       1   5.0
7    JMP  Utilities    2.5    1     1       0   3.0