Question

我正在尝试过滤我在R中的一些数据。它的格式如下：

          id config_id alpha         begin         end day
1          1         1     5           138         139   6
2          1         2     5           137         138   6
3          1         3     5            47          48   2
4          1         3     3            46          47   2
5          1         4     3            45          46   2
6          1         4     3            43          44   2

...

          id config_id alpha         begin         end day
1          2         1     5           138         139   6
2          2         2     5           137         138   6
3          2         2     5           136         137   6
4          2         3     3            45          46   2
5          2         3     3            44          45   2
6          2         4     3            43          44   2

我的目标是删除任何导致在同一天开始和结束的配置。例如，在顶部示例中config_id 3是不可接受的，因为config_id的两个实例都出现在day上2. config_id的相同故事4.在底部示例中{{出于同样的原因，1}} 2和config_id 3是不可接受的。

基本上，如果我重复config_id并且config_id列day列day列显示多次config_id，那么我想删除列表中的config_id。

现在我正在使用一种相当复杂的lapply算法，但必须有一种更简单的方法。

谢谢！

Answer 1

假设您的数据存储在名为my_data的数据框中，您可以采取多种方式。

碱基

same_day <- aggregate(my_data$day, my_data["config_id"], function(x) any(table(x) > 1))
names(same_day)[2] <- "same_day"
my_data <- merge(my_data, same_day, by = "config_id")
my_data <- same_day[!same_day$repeated_id, ]

dplyr

library(dplyr)
my_data %<>% group_by(config_id) %>%
  mutate(same_day = any(table(day) > 1)) %>%
  filter(!same_day)

data.table

library(data.table)
my_data <- data.table(my_data, key = "config_id")
same_day <- my_data[, .(same_day = any(table(day) > 1)), by = "config_id"]
my_data[!my_data[same_day]$same_day, ]

Answer 2

我们也可以使用n_distinct中的dplyr。在这里，我按照＆＃39; id＆＃39;分组和＆＃39; config_id＆＃39;，然后使用filter删除行。如果组中的元素数量大于1（n()>1)和（&），那么＆＃39; day＆＃39;中的不同元素的数量等于1（n_distinct==1 ），我们删除它。

library(dplyr)
df1 %>% 
   group_by(id, config_id) %>% 
   filter(!(n()>1 & n_distinct(day)==1))
#Source: local data frame [4 x 6]
#Groups: id, config_id [4]

#     id config_id alpha begin   end   day
#   (int)     (int) (int) (int) (int) (int)
#1     1         1     5   138   139     6
#2     1         2     5   137   138     6
#3     2         1     5   138   139     6
#4     2         4     3    43    44     2

如果我们有不同的日期，这也应该有用。对于相同的＆＃39; config_id＆＃39;。

df1$day[4] <- 3

使用data.table的类似选项是uniqueN。我们转换了＆＃39; data.frame＆＃39;到＆＃39; data.table＆＃39; （setDT(df1)），按＆＃39; id＆＃39;分组和＆＃39; config_id＆＃39;，我们使用逻辑条件对数据集（.SD）进行子集。

library(data.table)#v1.9.6+
setDT(df1)[, if(!(.N>1 & uniqueN(day) == 1L)) .SD, by = .(id, config_id)]

数据

df1 <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
2L, 2L), config_id = c(1L, 2L, 3L, 3L, 4L, 4L, 1L, 2L, 2L, 3L, 
3L, 4L), alpha = c(5L, 5L, 5L, 3L, 3L, 3L, 5L, 5L, 5L, 3L, 3L, 
3L), begin = c(138L, 137L, 47L, 46L, 45L, 43L, 138L, 137L, 136L, 
45L, 44L, 43L), end = c(139L, 138L, 48L, 47L, 46L, 44L, 139L, 
138L, 137L, 46L, 45L, 44L), day = c(6L, 6L, 2L, 2L, 2L, 2L, 6L, 
6L, 6L, 2L, 2L, 2L)), .Names = c("id", "config_id", "alpha", 
"begin", "end", "day"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

根据R中其他列的标准删除行

2 个答案:

碱基

dplyr

data.table

数据