我发现这个帖子Find rows in dataframe with maximum values grouped by values in another column已经讨论过其中一个解决方案。我正在使用此解决方案以递归方式查找具有最大数量的行索引。但是,我的解决方案非常难看 - 非常程序化,而不是矢量化。
这是我的虚拟数据:
dput(Data)
structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999,
1999, 2000, 2000, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002,
2002, 2002, 2002), Ship_Year = c(1997, 1998, 1999, 2000, 2001,
2002, NA, 1997, NA, 1997, 1998, 1999, 2000, 2001, 2002, NA, 1997,
1998, 1999, 2000), Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62,
627206.75998, 531087.43, 122167.02, 143855.55, 0, 0, 0, 0, 53650.389998,
17708416.3198, 98196.4, 31389, 0), Units = c(37, 1, 8, 5, 8,
8, 730, 99, 91, 195, 259, 4, 1, 3, 3, 53, 3844, 142, 63, 27)), .Names = c("Order_Year",
"Ship_Year", "Yen", "Units"), row.names = c(NA, 20L), class = "data.frame")
我想查找给定Ship_Year
的{{1}}和Yen
最高的Units
。
这是我试过的:
Order_Year
预期输出为:
a<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Yen), ]))
rownames(a)<-NULL
a$Yen<-NULL
a$Units<-NULL
#a has Ship_Year for which Yen is max for a given Order_Year
names(a)[2]<-"by.Yen"
#Now I'd find max year by units
b<-do.call("rbind", by(Data, Data$Order_Year, function(x) x[which.max(x$Units), ]))
rownames(b)<-NULL
b$Yen<-NULL
b$Units<-NULL
#b has Ship_Year for which Units is max for a given Order_Year
names(b)[2]<-"by.Qty"
c<-a %>% left_join(b)
虽然我得到了预期的输出,但上面的方法非常笨重。有没有更好的方法来解决这个问题?
答案 0 :(得分:4)
which.max
适用于dplyr分组:
library(dplyr)
Data %>% group_by(Order_Year) %>%
summarise(by.Yen = Ship_Year[which.max(Yen)],
by.Units = Ship_Year[which.max(Units)])
## # A tibble: 4 × 3
## Order_Year by.Yen by.Units
## <dbl> <dbl> <dbl>
## 1 1999 NA NA
## 2 2000 1997 1997
## 3 2001 1998 1998
## 4 2002 1997 1997
答案 1 :(得分:2)
我们可以使用data.table
。将'data.frame'转换为'data.table'(setDT(Data)
),按'Order_Year'分组,我们得到'Yen'的最大值索引,'Units'和match
,根据该索引子集“Ship_Year”的相应值,以返回汇总输出
library(data.table)
setDT(Data)[,.(by.Yen = Ship_Year[match(max(Yen), Yen)],
by.Units = Ship_Year[match(max(Units), Units)]) , Order_Year]
# Order_Year by.Yen by.Units
#1: 1999 NA NA
#2: 2000 1997 1997
#3: 2001 1998 1998
#4: 2002 1997 1997
如果有多列,而不是单独执行此操作,我们可以在.SDcols
中指定感兴趣的列,按'Order_Year'分组,循环遍历Data.table的子集(.SD
)获取最大值索引,unlist
list
输出,基于该索引将'Ship_Year'子集化,转换为list
(as.list
)并设置名称'by.Yen'和'by.Units'的列
setnames(setDT(Data)[, as.list(Ship_Year[unlist(lapply(.SD,
which.max))]), Order_Year, .SDcols = c("Yen", "Units")],
2:3, c("by.Yen", "by.Units"))[]
# Order_Year by.Yen by.Units
#1: 1999 NA NA
#2: 2000 1997 1997
#3: 2001 1998 1998
#4: 2002 1997 1997
答案 2 :(得分:2)
使用Base R
a1 <- with(df1,
by(data = df1,
INDICES = Order_Year,
FUN = function(x) list(Yen = x$Ship_Year[which.max(x$Yen)],
Units = x$Ship_Year[which.max(x$Units)])))
do.call("rbind", lapply(a1, function(x) data.frame(x)))
# Yen Units
# 1999 NA NA
# 2000 1997 1997
# 2001 1998 1998
# 2002 1997 1997
数据:
df1 <- structure(list(Order_Year = c(1999, 1999, 1999, 1999, 1999, 1999, 1999,
2000, 2000, 2001, 2001, 2001, 2001, 2001,
2001, 2001, 2002, 2002, 2002, 2002),
Ship_Year = c(1997, 1998, 1999, 2000, 2001, 2002, NA,
1997, NA, 1997, 1998, 1999, 2000, 2001,
2002, NA, 1997, 1998, 1999, 2000),
Yen = c(202598.2, 0, 0, 0, 0, 0, 2365901.62, 627206.75998,
531087.43, 122167.02, 143855.55, 0, 0, 0, 0,
53650.389998, 17708416.3198, 98196.4, 31389, 0),
Units = c(37, 1, 8, 5, 8, 8, 730, 99, 91, 195, 259, 4,
1, 3, 3, 53, 3844, 142, 63, 27)),
.Names = c("Order_Year", "Ship_Year", "Yen", "Units"),
row.names = c(NA, 20L),
class = "data.frame")