根据多个条件删除数据集中的行

时间:2017-04-14 16:47:51

标签: r

我的数据集包含动物ID,日期,年份,月份和日期。我需要删除给定年份中少于40个位置(在这种情况下为R中的40行)的所有动物ID。换句话说,动物ID = 1在2001年有20个地点;因此,从数据集中删除此个人。然后,我需要计算剩余记录集的数据月数。换句话说,我需要每年每个动物ID有> = 40个位置,跨越至少6个月。示例:动物ID 2具有> 2001年有40行数据,因此符合上述第一个标准,但2001年的40行数据仅为3个月;因此,需要从数据集中删除此个人。我似乎无法在R中找到一种快速的方法来基本上对我的数据集进行子集以解决上述两个问题。

初步编码我已开始工作:

newdata<-data[as.character(ave(data$Animal_ID, data$Animal_ID, FUN=length)) >= 40, ]

但我知道这并不完全正确。

Sample data set

dput(dataset)
structure(list(Animal_ID = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L), Date = structure(c(1L, 
2L, 39L, 46L, 43L, 53L, 55L, 57L, 62L, 72L, 77L, 77L, 78L, 79L, 
80L, 81L, 81L, 81L, 82L, 83L, 84L, 84L, 84L, 85L, 86L, 87L, 87L, 
88L, 92L, 102L, 102L, 103L, 104L, 104L, 104L, 104L, 104L, 104L, 
104L, 104L, 104L, 105L, 89L, 89L, 90L, 90L, 90L, 91L, 93L, 93L, 
94L, 95L, 96L, 96L, 97L, 97L, 98L, 98L, 98L, 98L, 98L, 98L, 98L, 
98L, 99L, 100L, 117L, 118L, 120L, 106L, 108L, 109L, 111L, 115L, 
116L, 3L, 3L, 8L, 13L, 15L, 16L, 17L, 18L, 19L, 4L, 45L, 47L, 
51L, 48L, 52L, 52L, 61L, 63L, 63L, 64L, 54L, 56L, 58L, 58L, 59L, 
60L, 60L, 60L, 71L, 73L, 74L, 75L, 76L, 76L, 65L, 66L, 66L, 67L, 
68L, 69L, 70L, 40L, 41L, 42L, 44L, 45L, 47L, 49L, 49L, 49L, 49L, 
49L, 49L, 49L, 49L, 50L, 50L, 51L, 89L, 90L, 91L, 93L, 94L, 94L, 
94L, 94L, 94L, 94L, 94L, 96L, 97L, 99L, 100L, 100L, 101L, 117L, 
118L, 118L, 119L, 120L, 121L, 106L, 107L, 107L, 108L, 109L, 110L, 
111L, 112L, 113L, 114L, 114L, 115L, 115L, 116L, 3L, 3L, 8L, 13L, 
17L, 18L, 18L, 19L, 4L, 5L, 5L, 6L, 7L, 9L, 9L, 10L, 11L, 12L, 
14L, 14L, 26L, 27L, 28L, 29L, 30L, 20L, 20L, 21L, 21L, 22L, 23L, 
24L, 25L, 34L, 35L, 37L, 38L, 31L, 32L, 33L, 36L), .Label = c("1/23/2001", 
"1/30/2001", "10/1/2002", "10/10/2002", "10/14/2002", "10/17/2002", 
"10/18/2002", "10/2/2002", "10/21/2002", "10/23/2002", "10/25/2002", 
"10/28/2002", "10/3/2002", "10/30/2002", "10/4/2002", "10/6/2002", 
"10/7/2002", "10/8/2002", "10/9/2002", "11/12/2002", "11/13/2002", 
"11/15/2002", "11/21/2002", "11/25/2002", "11/27/2002", "11/4/2002", 
"11/5/2002", "11/6/2002", "11/7/2002", "11/8/2002", "12/11/2002", 
"12/13/2002", "12/17/2002", "12/2/2002", "12/3/2002", "12/30/2002", 
"12/6/2002", "12/9/2002", "2/21/2001", "3/11/2002", "3/13/2002", 
"3/22/2002", "3/23/2001", "3/23/2002", "3/25/2002", "3/8/2001", 
"4/1/2002", "4/10/2002", "4/2/2002", "4/5/2002", "4/7/2002", 
"5/1/2002", "5/13/2001", "5/14/2002", "5/15/2001", "5/15/2002", 
"5/17/2001", "5/20/2002", "5/28/2002", "5/29/2002", "5/3/2002", 
"5/30/2001", "5/8/2002", "5/9/2002", "6/10/2002", "6/12/2002", 
"6/13/2002", "6/17/2002", "6/19/2002", "6/20/2002", "6/3/2002", 
"6/4/2001", "6/4/2002", "6/5/2002", "6/6/2002", "6/7/2002", "7/11/2002", 
"7/12/2002", "7/15/2002", "7/16/2002", "7/17/2002", "7/18/2002", 
"7/24/2002", "7/25/2002", "7/27/2002", "7/29/2002", "7/31/2002", 
"8/1/2002", "8/12/2002", "8/14/2002", "8/19/2002", "8/2/2002", 
"8/20/2002", "8/21/2002", "8/22/2002", "8/23/2002", "8/26/2002", 
"8/27/2002", "8/28/2002", "8/29/2002", "8/30/2002", "8/5/2002", 
"8/7/2002", "8/8/2002", "8/9/2002", "9/10/2002", "9/11/2002", 
"9/13/2002", "9/16/2002", "9/17/2002", "9/18/2002", "9/19/2002", 
"9/20/2002", "9/23/2002", "9/25/2002", "9/26/2002", "9/3/2002", 
"9/4/2002", "9/5/2002", "9/6/2002", "9/9/2002"), class = "factor"), 
    Year = c(2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 2001L, 
    2001L, 2001L, 2001L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 2002L, 
    2002L, 2002L, 2002L, 2002L, 2002L, 2002L), Month = c(1L, 
    1L, 2L, 3L, 3L, 5L, 5L, 5L, 5L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 
    7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 
    10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 3L, 4L, 4L, 
    4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 
    6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 3L, 3L, 
    3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 
    8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 9L, 
    9L, 9L, 9L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 
    10L, 10L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 11L, 
    11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L), Day = c(23L, 
    30L, 21L, 8L, 23L, 13L, 15L, 17L, 30L, 4L, 11L, 11L, 12L, 
    15L, 16L, 17L, 17L, 17L, 18L, 24L, 25L, 25L, 25L, 27L, 29L, 
    31L, 31L, 1L, 2L, 5L, 5L, 7L, 8L, 8L, 8L, 8L, 8L, 8L, 8L, 
    8L, 8L, 9L, 12L, 12L, 14L, 14L, 14L, 19L, 20L, 20L, 21L, 
    22L, 23L, 23L, 26L, 26L, 27L, 27L, 27L, 27L, 27L, 27L, 27L, 
    27L, 28L, 29L, 3L, 4L, 6L, 10L, 13L, 16L, 18L, 25L, 26L, 
    1L, 1L, 2L, 3L, 4L, 6L, 7L, 8L, 9L, 10L, 25L, 1L, 7L, 10L, 
    1L, 1L, 3L, 8L, 8L, 9L, 14L, 15L, 20L, 20L, 28L, 29L, 29L, 
    29L, 3L, 4L, 5L, 6L, 7L, 7L, 10L, 12L, 12L, 13L, 17L, 19L, 
    20L, 11L, 13L, 22L, 23L, 25L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L, 2L, 5L, 5L, 7L, 12L, 14L, 19L, 20L, 21L, 21L, 21L, 21L, 
    21L, 21L, 21L, 23L, 26L, 28L, 29L, 29L, 30L, 3L, 4L, 4L, 
    5L, 6L, 9L, 10L, 11L, 11L, 13L, 16L, 17L, 18L, 19L, 20L, 
    23L, 23L, 25L, 25L, 26L, 1L, 1L, 2L, 3L, 7L, 8L, 8L, 9L, 
    10L, 14L, 14L, 17L, 18L, 21L, 21L, 23L, 25L, 28L, 30L, 30L, 
    4L, 5L, 6L, 7L, 8L, 12L, 12L, 13L, 13L, 15L, 21L, 25L, 27L, 
    2L, 3L, 6L, 9L, 11L, 13L, 17L, 30L)), .Names = c("Animal_ID", 
"Date", "Year", "Month", "Day"), class = "data.frame", row.names = c(NA, 
-211L))

2 个答案:

答案 0 :(得分:0)

执行此操作的一种方法是在ID列中表示值。然后,逐步执行表值并删除符合条件的所有行。

我已经编制了一些数据:

df = data.frame(ID = c(rep('otter',5),rep('beaver',3),rep('muskrat',4)),
      locations=sample(1:12))
# create the table
table.ID= table(df$ID)
for (i in 1:length(table.ID)) {
    # if the number of occurrences matches the criterion  
    if (table.ID[i] > 4) {
      # remove those rows by finding out which rows have ID
      # values that match the tabled name
      df = df[ -which(df$ID==names(table.ID)[i]), ]
    }
}

将删除超过4行(位置)中出现的任何ID。

答案 1 :(得分:0)

您可以在dplyr包中轻松完成。假设数据集的名称是animal_data,这就是我如何运行它。

更新 - 我承认我以前不小心并且犯了一个大错误。但是下面的新代码集可以让你实现预期的结果,但我相信它仍然可以改进。

library(dplyr)

animal_data_by_n <- new_data %>% 
  group_by(Animal_ID, Year) %>% 
  filter(n() >= 40) # Only selecting animals that have records greater than 40 records for a given year

animal_data_by_n_Month <- animal_data_by_n %>% 
  group_by(Animal_ID, Year) %>%  
  summarise(n_Month = n_distinct(Month))

new_output <- merge(animal_data_by_n, animal_data_by_n_Month, by=c("Animal_ID","Year"), all.x=TRUE)
Final_subset <- subset(new_output, n_Month >= 6)

您可以稍后从最终数据框中删除n_month列