根据条件(位置)删除重复的行

时间:2019-09-29 15:13:34

标签: r duplicates data-manipulation

我有一个看起来像这样的数据集

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

我对每个ID和时间标识符都有多种观察-例如我有3个不同的alpha 1970值。我想每个ID /年仅保留一个观测值,最值得注意的是每个ID /年中出现的最后一个观测值。 最终数据集应如下所示:

final <- data.frame("id" = c("Alpha","Alpha","Beta","Beta","Beta"), 
                    "Year" = c(1970,1971,1980,1981,1982), 
                    "Val" = c(-2,5,5,3,5))

有人知道我如何解决这个问题吗?

非常感谢您的帮助

3 个答案:

答案 0 :(得分:2)

如果您愿意使用data.table解决方案,则可以非常简洁地完成此操作:

library(data.table)

setDT(df)[, .SD[.N], by = c("id", "Year")]
#>       id Year Val
#> 1: Alpha 1970  -2
#> 2: Alpha 1971   5
#> 3:  Beta 1980   5
#> 4:  Beta 1981   3
#> 5:  Beta 1982   5

by = c("id", "Year")idYear对data.table进行分组,然后.SD[.N]返回每个此类组中的最后一行。

答案 1 :(得分:1)

怎么样?

library(tidyverse)

df <- data.frame("id" = c("Alpha", "Alpha", "Alpha","Alpha","Beta","Beta","Beta","Beta"), 
                 "Year" = c(1970,1970,1970,1971,1980,1980,1981,1982), 
                 "Val" = c(2,3,-2,5,2,5,3,5))

final <- 
  df %>% 
  group_by(id, Year) %>% 
  slice(n()) %>% 
  ungroup()

final
#> # A tibble: 5 x 3
#>   id     Year   Val
#>   <fct> <dbl> <dbl>
#> 1 Alpha  1970    -2
#> 2 Alpha  1971     5
#> 3 Beta   1980     5
#> 4 Beta   1981     3
#> 5 Beta   1982     5

reprex package(v0.3.0)于2019-09-29创建

翻译为“在每个id-Year组中,仅获取行号等于组大小的行,即它是当前顺序下的最后一行。”

您也可以使用filter(),例如filter(row_number() == n())distinct()(这样您就不必分组了),例如distinct(id, Year, .keep_all = TRUE)-但是distinct函数占据第一行,因此您需要首先在这里反转行顺序。

答案 2 :(得分:1)

带有base R

的选项
aggregate(Val ~ ., df, tail, 1)
#     id Year Val
#1 Alpha 1970  -2
#2 Alpha 1971   5
#3  Beta 1980   5
#4  Beta 1981   3
#5  Beta 1982   5

如果我们需要选择第一行

aggregate(Val ~ ., df, head, 1)