检查列1值是否先前出现在具有不同列2值的数据集中

时间:2016-12-03 15:32:37

标签: r

我有我的数据集

df <- data.frame(ID = c("m1","m2","m3","m4","m5","m6","m2","m3","m5","m6","m1","m4","m5"),
                 Year = c(1,1,1,1,1,1,2,2,2,2,3,3,3))

并想检查ID是否出现在上一年。现在我有一个似乎有用的代码

df$Check <- apply(df, 1, function(x) x["ID"] %in% df[df$Year == (as.numeric(x["Year"]) - 1), "ID"])

但鉴于我的数据集长达300万行,此功能运行时间太长。有没有更好的替代方案?

3 个答案:

答案 0 :(得分:4)

尝试

library(dplyr)
dfs <- split(df$ID, df$Year);
df$check <- unlist(mapply(`%in%`, dfs,  lag(dfs)))

答案 1 :(得分:1)

k = length(unique(df$Year))        # how many years in the data
q = unique(df$Year)                # which are the years present

func <- function(x){  
  kk = df$ID[df$Year == q[x]]      # get the current year's ID which are present
  kk %in% df$ID[df$Year == q[x-1]] # compare that to the previous year's ID
}

x <- sum(df$Year==unique(df$Year)[1]) #to know how many FALSE to be added initially
df$check <- c(rep(FALSE, x),unlist(lapply(2:k, func)))

答案 2 :(得分:1)

您可以对每个ave使用ID:,计算当前YearYear之前的差异(diff)。垫有一个前导零。检查结果是否为1以创建逻辑向量:

df$check2 <- with(df, ave(Year, ID, FUN = function(x) c(0, diff(x))) == 1)
#    ID Year check check2
# 1  m1    1 FALSE  FALSE
# 2  m2    1 FALSE  FALSE
# 3  m3    1 FALSE  FALSE
# 4  m4    1 FALSE  FALSE
# 5  m5    1 FALSE  FALSE
# 6  m6    1 FALSE  FALSE
# 7  m2    2  TRUE   TRUE
# 8  m3    2  TRUE   TRUE
# 9  m5    2  TRUE   TRUE
# 10 m6    2  TRUE   TRUE
# 11 m1    3 FALSE  FALSE
# 12 m4    3 FALSE  FALSE
# 13 m5    3  TRUE   TRUE

data.table类似:

对于每个IDby = ID),创建新变量check2:检查数据中当前YearYear之间的差异是否为1((diff(year) == 1),即如果前一年是之前的年。

library(data.table)
setDT(df)[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]

通过OP编辑以下评论。如果“同一个ID中的多个条目”,则对删除了重复行的数据执行计算(unique)。然后将结果加入原始数据。

df2 <- unique(df)
df2[ , Check2 := c(FALSE, diff(Year) == 1), by = ID]
df[df2, on = c("ID", "Year")]