我从许多调查中得到了数据。可以使用更新的值多次发送每个调查。对于数据集中的每个调查/行,都有一个提交(创建)调查的日期。我想合并每个调查的行,并保留第一个调查的日期,但保留上次调查的其他数据。
一个简单的例子:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 10 90
#> 3 s2 2020-01-03 20 20
#> 4 s3 2020-01-01 45 5
#> 5 s3 2020-01-02 50 50
#> 6 s3 2020-01-03 30 10
所需结果:
#> survey created var1 var2
#> 1 s1 2020-01-01 10 30
#> 2 s2 2020-01-02 20 20
#> 3 s3 2020-01-01 30 10
示例数据:
df <- data.frame(survey = c("s1", "s2", "s2", "s3", "s3", "s3"),
created = as.POSIXct(c("2020-01-01", "2020-01-02", "2020-01-03", "2020-01-01", "2020-01-02", "2020-01-03"), "%Y-%m-%d", tz = "GMT"),
var1 = c(10, 10, 20, 45, 50, 30),
var2 = c(30, 90, 20, 5, 50, 10),
stringsAsFactors=FALSE)
我以不同的方式尝试了group_by
和summarize
的使用,但是无法使其正常工作,任何帮助将不胜感激!
答案 0 :(得分:3)
按“调查”分组后,将“已创建”更改为“已创建”中的first
或min
值,然后slice
最后一行(n()
)
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
slice(n())
# A tibble: 3 x 4
# Groups: survey [3]
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10
或使用base R
transform(df, created = ave(created, survey, FUN = first)
)[!duplicated(df$survey, fromLast = TRUE),]
答案 1 :(得分:2)
选择第一个created
日期后,我们可以从所有列中选择last
值。
library(dplyr)
df %>%
group_by(survey) %>%
mutate(created = as.Date(first(created))) %>%
summarise(across(created:var2, last))
#In older version use `summarise_at`
#summarise_at(vars(created:var2), last)
# A tibble: 3 x 4
# survey created var1 var2
# <chr> <date> <dbl> <dbl>
#1 s1 2020-01-01 10 30
#2 s2 2020-01-02 20 20
#3 s3 2020-01-01 30 10