我的数据框的head
如下所示:
structure(list(wbcode = c("ARG", "ARG", "ARG", "ARG", "ARG",
"ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG",
"ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG", "ARG",
"ARG", "ARG", "ARG"), End = c(NA, NA, NA, NA, NA, NA, 1982, NA,
NA, NA, NA, NA, NA, NA, NA, 1991, NA, NA, NA, NA, NA, 1995, NA,
NA, NA, NA), LS = c(0.958041958041958, 1.20320197044335, 1.16087598763312,
0.354430888167198, 0.0475120757386165, 0.0236186492578896, 0.0916911204214743,
0.14338253921938, 0.408800511837039, 0.385495983810026, 0.244688077879152,
NA, NA, NA, NA, NA, 1.23774478543667, 1.06301680926773, 0.670834486120376,
0.60283371506345, 0.437946526596944, 0.468570146238378, 0.30623825822946,
0.0241300985598649, 0.0201213236433166, 0.0223558659752478),
year = c("1974", "1975", "1976", "1977", "1978", "1979",
"1980", "1981", "1982", "1983", "1984", "1985", "1986", "1987",
"1988", "1989", "1990", "1991", "1992", "1993", "1994", "1995",
"1996", "1997", "1998", "1999")), row.names = c(NA, -26L), class = c("tbl_df",
"tbl", "data.frame"))
我要实现的是创建一个新列LS_max
,其中包含LS
和year
之间的End
的最大值(如果{{1} }存在)。产生的数据框如下所示:
End
请注意,原始数据帧包含不止一种# A tibble: 26 x 4
# wbcode End LS year LS_max
# <chr> <dbl> <dbl> <chr> <dbl>
# 1 ARG NA 0.958 1974 NA
# 2 ARG NA 1.20 1975 NA
# 3 ARG NA 1.16 1976 NA
# 4 ARG NA 0.354 1977 NA
# 5 ARG NA 0.0475 1978 NA
# 6 ARG NA 0.0236 1979 NA
# 7 ARG 1982 0.0917 1980 0.409
# 8 ARG NA 0.143 1981 NA
# 9 ARG NA 0.409 1982 NA
#10 ARG NA 0.385 1983 NA
#11 ARG NA 0.245 1984 NA
#12 ARG NA NA 1985 NA
#13 ARG NA NA 1986 NA
#14 ARG NA NA 1987 NA
#15 ARG NA NA 1988 NA
#16 ARG 1991 NA 1989 1.24
#17 ARG NA 1.24 1990 NA
#18 ARG NA 1.06 1991 NA
#19 ARG NA 0.671 1992 NA
#20 ARG NA 0.603 1993 NA
#21 ARG NA 0.438 1994 NA
#22 ARG 1995 0.469 1995 0.469
#23 ARG NA 0.306 1996 NA
#24 ARG NA 0.0241 1997 NA
#25 ARG NA 0.0201 1998 NA
#26 ARG NA 0.0224 1999 NA
类型。任何帮助将不胜感激。
答案 0 :(得分:1)
一种选择是根据“结束”列中NA
的出现来创建分组列,获取“ LS”的max
并随后删除分组列
library(dplyr)
df1 %>%
group_by(wbcode, grp = cumsum(!is.na(End))) %>%
mutate(LS_max = max(LS, na.rm = TRUE) * NA^is.na(End))%>%
ungroup %>%
select(-grp) %>%
as.data.frame
# wbcode End LS year LS_max
#1 ARG NA 0.95804196 1974 NA
#2 ARG NA 1.20320197 1975 NA
#3 ARG NA 1.16087599 1976 NA
#4 ARG NA 0.35443089 1977 NA
#5 ARG NA 0.04751208 1978 NA
#6 ARG NA 0.02361865 1979 NA
#7 ARG 1982 0.09169112 1980 0.4088005
#8 ARG NA 0.14338254 1981 NA
#9 ARG NA 0.40880051 1982 NA
#10 ARG NA 0.38549598 1983 NA
#11 ARG NA 0.24468808 1984 NA
#12 ARG NA NA 1985 NA
#13 ARG NA NA 1986 NA
#14 ARG NA NA 1987 NA
#15 ARG NA NA 1988 NA
#16 ARG 1991 NA 1989 1.2377448
#17 ARG NA 1.23774479 1990 NA
#18 ARG NA 1.06301681 1991 NA
#19 ARG NA 0.67083449 1992 NA
#20 ARG NA 0.60283372 1993 NA
#21 ARG NA 0.43794653 1994 NA
#22 ARG 1995 0.46857015 1995 0.4685701
#23 ARG NA 0.30623826 1996 NA
#24 ARG NA 0.02413010 1997 NA
#25 ARG NA 0.02012132 1998 NA
#26 ARG NA 0.02235587 1999 NA
答案 1 :(得分:1)
这可以通过使用复杂条件聚合自连接来实现。
此左连接到a
的{{1}}实例的每一行,到DF
的{{1}}实例的b
实例的所有行,具有相同的DF
并且满足wbcode
条件。
然后,对于结果中的每个between
行,我们从a
中提取连接的LS
值中的最大值。
b
给予:
library(sqldf)
sqldf("select a.*, max(b.LS) as LS_max
from DF a
left join DF b on a.wbcode = b.wbcode and b.year between a.year and a.End
group by a.rowid")