我有一个看起来像这样的数据集
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
我对每个ID和时间标识符都有多种观察-例如我有3个不同的alpha 1970值。我想每个ID /年仅保留一个观测值,最值得注意的是每个ID /年中出现的最后一个观测值。 最终数据集应如下所示:
final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"),
"Year" = c(1970,1971,1980,1981,1982),
"Val" = c(-2,5,5,3,5))
有人知道我如何解决这个问题吗?
非常感谢您的帮助
答案 0 :(得分:2)
如果您愿意使用data.table
解决方案,则可以非常简洁地完成此操作:
library(data.table)
setDT(df)[, .SD[.N], by = c("id", "Year")]
#> id Year Val
#> 1: Alpha 1970 -2
#> 2: Alpha 1971 5
#> 3: Beta 1980 5
#> 4: Beta 1981 3
#> 5: Beta 1982 5
by = c("id", "Year")
按id
和Year
对data.table进行分组,然后.SD[.N]
返回每个此类组中的最后一行。
答案 1 :(得分:1)
怎么样?
library(tidyverse)
df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"),
"Year" = c(1970,1970,1970,1971,1980,1980,1981,1982),
"Val" = c(2,3,-2,5,2,5,3,5))
final <-
df %>%
group_by(id, Year) %>%
slice(n()) %>%
ungroup()
final
#> # A tibble: 5 x 3
#> id Year Val
#> <fct> <dbl> <dbl>
#> 1 Alpha 1970 -2
#> 2 Alpha 1971 5
#> 3 Beta 1980 5
#> 4 Beta 1981 3
#> 5 Beta 1982 5
由reprex package(v0.3.0)于2019-09-29创建
翻译为“在每个id-Year组中,仅获取行号等于组大小的行,即它是当前顺序下的最后一行。”
您也可以使用filter()
,例如filter(row_number() == n())
或distinct()
(这样您就不必分组了),例如distinct(id, Year, .keep_all = TRUE)
-但是distinct
函数占据第一行,因此您需要首先在这里反转行顺序。
答案 2 :(得分:1)
带有base R
aggregate(Val ~ ., df, tail, 1)
# id Year Val
#1 Alpha 1970 -2
#2 Alpha 1971 5
#3 Beta 1980 5
#4 Beta 1981 3
#5 Beta 1982 5
如果我们需要选择第一行
aggregate(Val ~ ., df, head, 1)