在Pandas中,如何在max perfornace中的列之间进行操作

时间:2017-01-16 14:56:43

标签: python pandas

我有以下df:

     usersidid  clienthostid    LoginDaysSumLastMonth   LoginDaysSumLast7Days LoginDaysSum
0       9            1                50                          7              1728
1       3            1                43                          3              1331
2       6            1                98                          9               216
3       4            1                10                          6                64
4       9            2                64                          32              343
5       12           3                45                          43             1000
6       8            3                87                          76              512
7       9            3                16                          3              1200

我想做的是:

对于每个' clienthostid'寻找' usersidid'使用最高的' LoginDaysSum',我检查是否存在两个不同clienthostid中最高的LoginDaysSum的用户ID(例如,usersidid = 9 ia是clienthostid 1,2和3中最高的LoginDaysSum,在行0中因此,4和7)。

在这种情况下,我想选择更高的LoginDaysSum(在示例中它将是1728的行),让我们称之为maxRT。

我想计算maxRT与其他每一行之间的LoginDaysSumLast7Days的比率(例如,它将是行索引7和4)。

如果比率低于0.8,我想放弃该行:

index 4- LoginDaysSumLast7Days_ratio = 7/32< 0.8 //行会掉线!

index 7- LoginDaysSumLast7Days_ratio = 7/3> 0.8 //行将停留!

同样的条件也适用于LoginDaysSumLastMonth。

因此,对于示例,结果将是:

     usersidid  clienthostid    LoginDaysSumLastMonth   LoginDaysSumLast7Days LoginDaysSum
0       9            1                50                          7              1728
1       3            1                43                          3              1331
2       6            1                98                          9               216
3       4            1                10                          6                64
5       12           3                45                          43             1000
6       8            3                87                          76              512
7       9            3                16                          3              1200

现在,阻碍性能至关重要。 我尝试使用.apply来实现它,但不仅我不能使它正常工作,它也运行得太慢了:(

我的代码到目前为止(请原谅我写的非常错误,我上周才第一次使用SQL,Pandas和Python开始工作,我学到的一切都来自我在这里找到的例子^ _ ^) :

df_client_Logindayssum_pairs = df.merge(df.groupby(['clienthostid'], as_index=False, sort=False)['LoginDaysSum'].max(),df, how='inner', on=['clienthostid', 'LoginDaysSum'])
    UsersWithMoreThan1client = df_client_Logindayssum_pairs.groupby(['usersidid'], as_index=False, sort=False)['LoginDaysSum'].count().rename(columns={'LoginDaysSum': 'NumOfClientsPerUesr'})
    UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.NumOfClientsPerUesr >= 2]
    UsersWithMoreThan1client = df_client_Logindayssum_pairs[df_client_Logindayssum_pairs.usersidid.isin(UsersWithMoreThan1Device.loc[:, 'usersidid'])].reset_index(drop=True)
    UsersWithMoreThan1client = UsersWithMoreThan1client.sort_values(['clienthostid', 'LoginDaysSum'], ascending=[True, False], inplace=True)
    UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLast7Days'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio')
    UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio > 0.8]
    UsersWithMoreThan1client = ttm.groupby(['clienthostid'], sort=False)['LoginDaysSumLastMonth'].apply(lambda x: x.iloc[0] / x.iloc[1]).reset_index(name='ratio2')
    UsersWithMoreThan1client = UsersWithMoreThan1client[UsersWithMoreThan1client.ratio2 > 0.8]

非常感谢有关如何做的任何建议

谢谢

1 个答案:

答案 0 :(得分:2)

我相信这就是你所需要的:

# Put the index as a regular column
data = data.reset_index()
# Find greates LoginDaysSum for each clienthostid
agg1 = data.sort_values(by='LoginDaysSum', ascending=False).groupby(['clienthostid']).first()
# Collect greates LoginDaysSum for each usersidid
agg2 = agg1.sort_values(by='LoginDaysSum', ascending=False).groupby('usersidid').first()
# Join both previous aggregations
joined = agg1.set_index('usersidid').join(agg2, rsuffix='_max')
# Compute ratios
joined['LoginDaysSumLast7Days_ratio'] = joined['LoginDaysSumLast7Days_max'] / joined['LoginDaysSumLast7Days']
joined['LoginDaysSumLastMonth_ratio'] = joined['LoginDaysSumLastMonth_max'] / joined['LoginDaysSumLastMonth']
# Select index values that do not meet the required criteria
rem_idx = joined[(joined['LoginDaysSumLast7Days_ratio'] < 0.8) | (joined['LoginDaysSumLastMonth_ratio'] < 0.8)]['index']
# Restore index and remove the selected rows
data = data.set_index('index').drop(rem_idx)

data中的结果是:

       usersidid  clienthostid  LoginDaysSumLastMonth  LoginDaysSumLast7Days    LoginDaysSum  
index                                                                                         
0              9             1                     50                      7            1728  
1              3             1                     43                      3            1331  
2              6             1                     98                      9             216  
3              4             1                     10                      6              64  
5             12             3                     45                     43            1000  
6              8             3                     87                     76             512  
7              9             3                     16                      3            1200