使用R有条件地使用另一列选择组内的最后N个值

时间:2017-07-31 17:30:52

标签: r dataframe dplyr

此问题类似于按列here选择组中的前N个值。

但是,我想按组选择最后N个值,N取决于相应计数列的值。计数表示特定名称的出现次数。如果count> 3,我只想要最后三个条目,但如果它小于3,我只想要最后一个条目。

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), Value = c(1,2,3,4,5,6,7,8,9))

# Obtain count for each name
count <- df %>%
  group_by(Name) %>%
  summarise(Count = n_distinct(Value))

# Merge dataframe with count
merge(df, count, by=c("Name"))

# Delete the first entry for x and the first entry for z

# Desired output
data.frame(Name = c("x","x","x","y","y","y","z"), Value = c(2,3,4,5,6,7,9))

4 个答案:

答案 0 :(得分:4)

另一种愚蠢的方式:

df %>% group_by(Name) %>% slice(tail(row_number(), 
  if (n_distinct(Value) < 3) 1 else 3
))

# A tibble: 7 x 2
# Groups:   Name [3]
    Name Value
  <fctr> <dbl>
1      x     2
2      x     3
3      x     4
4      y     5
5      y     6
6      y     7
7      z     9

data.table中的模拟是......

library(data.table)
setDT(df)
df[, tail(.SD, if (uniqueN(Value) < 3) 1 else 3), by=Name]

基地R中最接近的是......

with(df, {
  len = tapply(Value, Name, FUN = length)
  nv  = tapply(Value, Name, FUN = function(x) length(unique(x)))
  df[ sequence(len) > rep(nv - ifelse(nv < 3, 1, 3), len), ]
})

......这比应该更难以提出。

答案 1 :(得分:3)

另一种可能性:

library(tidyverse)

df %>%
  split(.$Name) %>%
  map_df(~ if (n_distinct(.x) >= 3) tail(.x, 3) else tail(.x, 1))

给出了:

#  Name Value
#1    x     2
#2    x     3
#3    x     4
#4    y     5
#5    y     6
#6    y     7
#7    z     9

答案 2 :(得分:2)

在基数R中,首先将df除以df$Name。然后,对于每个子组,检查行数并有条件地提取最后3行或最后1行。

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), c(3,1)[(NROW(a) < 3) + 1]),]))

do.call(rbind, lapply(split(df, df$Name), function(a)
    a[tail(sequence(NROW(a)), ifelse(NROW(a) < 3, 1, 3)),]))
#    Name Value
#x.2    x     2
#x.3    x     3
#x.4    x     4
#y.5    y     5
#y.6    y     6
#y.7    y     7
#z      z     9

对于三个条件值

do.call(rbind, lapply(split(df, df$Name), function(a)
      a[tail(sequence(NROW(a)), ifelse(NROW(a) >= 6, 6, ifelse(NROW(a) >= 3, 3, 1))),]))

答案 3 :(得分:2)

如果你已经在使用dplyr,那么自然的方法就是:

library(dplyr)

# Sample data
df <- data.frame(Name = c("x","x","x","x","y","y","y","z","z"), 
                 Value = c(1,2,3,4,5,6,7,8,9))

df %>%
  group_by(Name) %>%
  mutate(Count = n_distinct(Value),
         Rank = dense_rank(desc(Value))) %>% 
  filter((Count>= 3 & Rank <= 3) | (Rank==1)) %>%
  select(-c(Count,Rank))

由于您只是计算并按名称定义的组进行排名,因此不需要merge。然后,您对计数和排名要求应用过滤器,并且(可选地,用于清理)丢弃计数和排名。