从两列中查找最大的一对,同时保持数据框完整

时间:2020-09-22 00:36:54

标签: r dataframe plyr data-wrangling

我有一个数据框,我想根据两列查找最大的对。但是,当我对数据框进行分组时,其他列上的细微变化都会影响我的结果。

让我告诉你:

library(plyr)

usercsv_data <- data.frame(id_str = c("89797", "12387231231", "1234823432", "3483487344", "89797", "1234823432"),
                           screen_name = c("A", "B", "C", "D", "A", "C"),
                           location = c("FL", "CO", "NYC", "MI", "FL", "NYC"),
                           verified = c("Y", "N", "N", "Y", "N", "Y"),
                           created = c("Sun", "Mon", "Tue", "Sun", "Tue", "Fri"),
                           friends_count = c(1,2,5,787,7, 5),
                           followers_count= c(2,4,6,897,4,3))

#         id_str screen_name location verified created friends_count followers_count
# 1       89797           A       FL        Y     Sun             1               2
# 2 12387231231           B       CO        N     Mon             2               4
# 3  1234823432           C      NYC        N     Tue             5               6
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4
# 6  1234823432           C      NYC        Y     Fri             5               3


#This gets me the max pairs when the groups variable are unique
plyr::ddply(usercsv_data,.(id_str,screen_name),numcolwise(max))

#         id_str screen_name friends_count followers_count
# 1  1234823432           C             5               6
# 2 12387231231           B             2               4
# 3  3483487344           D           787             897
# 4       89797           A             7               4


#BUT, when I want to do same technique with whole dataframe, I get same dataframe
plyr::ddply(usercsv_data,.(id_str,screen_name, location,verified,created),numcolwise(max))

#         id_str screen_name location verified created friends_count followers_count
# 1  1234823432           C      NYC        N     Tue             5               6
# 2  1234823432           C      NYC        Y     Fri             5               3
# 3 12387231231           B       CO        N     Mon             2               4
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4
# 6       89797           A       FL        Y     Sun             1               2

但是我想要这样的东西-

#         id_str screen_name location verified created friends_count followers_count
# 1  1234823432           C      NYC        N     Tue             5               6
# 3 12387231231           B       CO        N     Mon             2               4
# 4  3483487344           D       MI        Y     Sun           787             897
# 5       89797           A       FL        N     Tue             7               4

如何分组以便维护所有列,但仅保留存在最大对的行?当前,当组变量更多时,它会保留唯一的变量(应该是这样),但由于知识不足,我也无法搜索问题。

1 个答案:

答案 0 :(得分:3)

plyr已停用,因此我们可以在此处使用dplyr,方法是创建一个列,该列的总和为friends_countfollowers_count,然后为每个{{ 1}}和id_str

screen_name

或者不创建library(dplyr) usercsv_data %>% mutate(max = rowSums(select(., friends_count, followers_count))) %>% group_by(id_str, screen_name) %>% slice(which.max(max)) 列。

max