Question

数据： -

df=data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),Year=c(2016,2015,2014,2016,2006,2006),Balance=c(100,150,65,75,150,10))

   Name Year Balance
1  John 2016     100
2  John 2015     150
3 Stacy 2014      65
4 Stacy 2016      75
5   Kat 2006     150
6   Kat 2006      10

代码： -

aggregate(cbind(Year,Balance)~Name,data=df,FUN=max )

输出： -

   Name Year Balance
1  John 2016     150
2   Kat 2006     150
3 Stacy 2016      75

我想使用Year和Balance这两个列来汇总/汇总上面的数据框。我使用基本功能聚合来执行此操作。我需要最近一年/最近一年的最大余额。在输出的第一行，约翰有最新的一年（2016），但是（2015）的余额，这不是我需要的，它应该输出100而不是150.我在哪里错了？

Answer 1

有点讽刺的是，aggregate是一种糟糕的聚合工具。你可以让它工作，但我会这样做：

library(data.table)

setDT(df)[order(-Year, -Balance), .SD[1], by = Name]
#    Name Year Balance
#1:  John 2016     100
#2: Stacy 2016      75
#3:   Kat 2006     150

Answer 2

我建议使用库dplyr：

data.frame(Name=c("John","John","Stacy","Stacy","Kat","Kat"),
           Year=c(2016,2015,2014,2016,2006,2006),
           Balance=c(100,150,65,75,150,10)) %>% #create the dataframe
    tbl_df() %>% #convert it to dplyr format
    group_by(Name, Year) %>% #group it by Name and Year
    summarise(maxBalance=max(Balance)) %>% # calculate the maximum for each group
    group_by(Name) %>% # group the resulted dataframe by Name
    top_n(1,maxBalance) # return only the first record of each group

Answer 3

这是没有data.table包的另一种解决方案。

首先对数据框进行排序，

df <- df[order(-df$Year, -df$Balance),]

然后选择每个组中具有相同名称的第一个

df[!duplicated[df$Name],]

R中的聚合函数同时使用两列

3 个答案: