我在R中有一个data.frame
,其样本数据如下所示:
dat <- data.frame(NAME=c("NAME1","NAME1","NAME1","NAME1","NAME2","NAME2","NAME2","NAME2") , SURVEY_YEAR =c(1947,1958,1978,1987,1963,1991,2004,1993), REFERENCE_YEAR=c(1934,1947,1974,1947,1944,1987,1993,1987), VALUE=c(10,15,13,20,-2,7,12,-19))
dat
NAME SURVEY_YEAR REFERENCE_YEAR VALUE
1 NAME1 1947 1934 10
2 NAME1 1958 1947 15
3 NAME1 1978 1974 13
4 NAME1 1987 1947 20
5 NAME2 1963 1944 -2
6 NAME2 1991 1987 7
7 NAME2 2004 1993 12
8 NAME2 1993 1987 -19
我怎样才能先按REFERENCE_YEAR
(从最低到最高)对其进行排序:
NAME SURVEY_YEAR REFERENCE_YEAR VALUE
1 NAME1 1947 1934 10
2 NAME1 1958 1947 15
3 NAME1 1987 1947 20
4 NAME1 1978 1974 13
5 NAME2 1963 1944 -2
6 NAME2 1991 1987 7
7 NAME2 1993 1987 -19
8 NAME2 2004 1993 12
然后,如果REFERENCE_YEAR
中的某一年相同,请从REFERENCE_YEAR
删除覆盖较长期间(从SURVEY_YEAR
到dat
)的那一年,然后写下将行删除为新的data.frame
?
带有示例数据的data.frame最终应该如下所示:
NAME SURVEY_YEAR REFERENCE_YEAR VALUE
1 NAME1 1947 1934 10
2 NAME1 1958 1947 15
3 NAME1 1978 1974 13
4 NAME2 1963 1944 -2
5 NAME2 1991 1987 7
6 NAME2 2004 1993 12
答案 0 :(得分:0)
第一步是对REFERENCE_YEAR和&amp; SURVEY_YEAR。具有最长间隔的项目将首先排序,并由duplicated()函数选择为NOT-duplicated,因此只需使用逻辑索引将它们保留:
> dat2 <- dat[ order(dat$REFERENCE_YEAR, dat$SURVEY_YEAR) , ]
> dat2 <- dat2[ !duplicated( dat2$REFERENCE_YEAR) , ]
> dat2
NAME SURVEY_YEAR REFERENCE_YEAR VALUE
1 NAME1 1947 1934 10
5 NAME2 1963 1944 -2
2 NAME1 1958 1947 15
3 NAME1 1978 1974 13
6 NAME2 1991 1987 7
7 NAME2 2004 1993 12
答案 1 :(得分:0)
BondedDust留下了一个优雅的答案。我的回答比他长得多。但是,让我离开它。
dat %>%
arrange(REFERENCE_YEAR) %>%
mutate(gap = SURVEY_YEAR - REFERENCE_YEAR) %>%
arrange(REFERENCE_YEAR, gap) %>%
group_by(NAME, REFERENCE_YEAR) %>%
filter(gap == gap[1]) %>%
arrange(NAME,REFERENCE_YEAR)
# NAME SURVEY_YEAR REFERENCE_YEAR VALUE gap
#1 NAME1 1947 1934 10 13
#2 NAME1 1958 1947 15 11
#3 NAME1 1978 1974 13 4
#4 NAME2 1963 1944 -2 19
#5 NAME2 1991 1987 7 4
#6 NAME2 2004 1993 12 11