Question

编辑

感谢您的答复。但是，我仍然无法解决我的问题，因为我的数据集包含700,000个观测值，并且以下所有方法都会导致错误，或者只是继续运行数小时而没有完成操作（我可以说Rstudio R Session正在运行，并且消耗了我大量的RAM，但根本没有任何作用。

您可以想象，将数据集分割成小块不是一个选择，因为这样做会破坏练习的目的：我需要仔细检查每个先前的观察结果才能得出所需的结果。

有什么想法吗？我暂时不会回答这个问题，但是如果你们认为我应该发布一个新问题，我会（老实说，我不了解有关这些事情的礼节，因此可以提出建议）。

原始帖子

正如标题所示，我正在寻找一个虚拟变量，该变量以分组观察中的重复为条件。

请考虑以下数据框：

   id name year
1   c   af 2000
2   c   el 2000
3   c   in 2000
4   c   ud 2000
5   d   ot 2000
6   d   an 2000
7   d   el 2000
8   d   un 2000
9   f   yt 2002
10  f   ip 2002
11  f   ot 2002
12  f   el 2002
13  g   yt 2003
14  g   af 2003
15  g   ol 2003
16  g   in 2003
17  h   in 2003
18  h   eg 2003
19  h   yt 2003
20  h   af 2003
21  j   ot 2004
22  j   el 2004
23  j   ip 2004
24  j   yt 2004

我正在寻找一个函数，该函数将允许我按id对数据进行分组，如果一个id至少包含个先前的id中的三个名称，则返回值“ 1”。上一个ID的意思是，上一个ID的年份要小于当前ID的年份。

所需的输出应如下所示：

id name year dummy 1 c af 2000 0 2 c el 2000 0 3 c in 2000 0 4 c ud 2000 0 5 d ot 2000 0 6 d an 2000 0 7 d el 2000 0 8 d un 2000 0 9 f yt 2002 0 10 f ip 2002 0 11 f ot 2002 0 12 f el 2002 0 13 g yt 2003 0 14 g af 2003 0 15 g ol 2003 0 16 g in 2003 0 17 h in 2003 0 18 h eg 2003 0 19 h yt 2003 0 20 h af 2003 0 21 j ot 2004 1 22 j el 2004 1 23 j ip 2004 1 24 j yt 2004 1

id =“ j”具有值dummy =“ 1”，因为在id =“ f”中至少出现三个名称，分别是“ yt”，“ ip”和“ ot”。在这种情况下，还会出现第四个名称“ el”，但这不会影响结果。

请注意，即使id =“ g”中也出现了三个名称，id =“ h”的值仍为dummy =“ 0”。这是因为这两次发生都发生在2003年，因此不能满足单独年份的条件。

数据：

DF = structure(list(id = c("c", "c", "c", "c", "d", "d", "d", "d", "f", "f", "f", "f", "g", "g", "g", "g", "h", "h", "h", "h", "j", "j", "j", "j"), name = c("af", "el", "in", "ud", "ot", "an", "el", "un", "yt", "ip", "ot", "el", "yt", "af", "ol", "in", "in", "eg", "yt", "af", "ot", "el", "ip", "yt"), year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2000L, 2002L, 2002L, 2002L, 2002L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2003L, 2004L, 2004L, 2004L, 2004L), dummy = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L)), .Names = c("id", "name", "year", "dummy"), row.names = c(NA, -24L), class = "data.frame")

Answer 1

基于R的方法：

n <- split(DF$name, DF$id)
m1 <- sapply(n, function(s1) sapply(n, function(s2) sum(s1 %in% s2) ))
diag(m1) <- 0
m1[upper.tri(m1)] <- 0
r1 <- rownames(m1)[!!rowSums(m1 > 2)]

y <- sapply(split(DF$year, DF$id), unique)
m2 <- sapply(y, function(s1) sapply(y, function(s2) +(s1 == s2) ))
diag(m2) <- 0
m2[upper.tri(m2)] <- 0
r2 <- rownames(m2)[!rowSums(m2)]

DF$dummy2 <- as.integer(DF$id %in% intersect(r1,r2))

给出：

> DF
   id name year dummy dummy2
1   c   af 2000     0      0
2   c   el 2000     0      0
3   c   in 2000     0      0
4   c   ud 2000     0      0
5   d   ot 2000     0      0
6   d   an 2000     0      0
7   d   el 2000     0      0
8   d   un 2000     0      0
9   f   yt 2002     0      0
10  f   ip 2002     0      0
11  f   ot 2002     0      0
12  f   el 2002     0      0
13  g   yt 2003     0      0
14  g   af 2003     0      0
15  g   ol 2003     0      0
16  g   in 2003     0      0
17  h   in 2003     0      0
18  h   eg 2003     0      0
19  h   yt 2003     0      0
20  h   af 2003     0      0
21  j   ot 2004     1      1
22  j   el 2004     1      1
23  j   ip 2004     1      1
24  j   yt 2004     1      1

Answer 2

类似于Jaap和see24，但使用length(intersect(x,y)) / ==和%in% / rowSums代替sum / library(data.table) setDT(DF) idDT = unique(DF[, .(id, year)]) setkey(idDT, id) s = split(DF$name, DF$id) # identify pairs of ids, where id1 appears before id2 in the table pairsDT = idDT[, CJ(id1 = id, id2 = id)[id1 < id2]] # record whether it's strictly before pairsDT[, earlier := idDT[id1, x.year] < idDT[id2, x.year]] # if it's strictly before, compare number of matching elements pairsDT[earlier == TRUE, matched := mapply(function(x, y) length(intersect(x, y)), s[id1], s[id2]) >= 3 ] dum_ids = pairsDT[matched == TRUE, unique(id2)]：

idDT[, dum := id %in% dum_ids]
DF[, dum := id %in% dum_ids]

然后您可以将标准记录在idDT（更有意义的地方）或DF中：

combn

在基数R中，可以使用zoo <- matrix (c("zoo1", "bee", "honeybee", "alligator", "tiger", 0, "zoo2", "tiger", "honeybee", "lion", 0, 0, "zoo3", "alligator", "alligator", "alligator", "bee", "wasp", "zoo4", "wasp", "honeybee", 0, 0, 0, "zoo5", "alligator", "lion", "tiger", "bear", 0), nrow = 5, ncol = 6, byrow = TRUE) colnames(zoo) <- c("zoo", "A", "B", "C","D","E") zoo <- data.frame(zoo)完成类似的操作。与仅将数据存储在图形中（例如，使用igraph包）并从那里开始工作相比，我认为这仍然效率很低。

Answer 3

这是我使用dplyr和tidyr的解决方案，以及一个使用3个或更多匹配名称标识ID的函数：

library(dplyr)
library(tidyr)

test <- function(x){
  out2 <- sapply(1:length(x), function(j){
    out <- sapply(1:j, function(i){
      sum(x[[j]] %in% x[[i]])
    })
    out[j]<-NA
    which(out >= 3) %>% min() %>% {ifelse(is.infinite(.),NA,.)}

  })
  out2
}

DF2 <-  DF %>% group_by(id, year) %>% 
  summarise(names = list(name)) %>% ungroup() %>% 
  mutate(dummy2 = test(names)) %>% 
  mutate(year_mch = year[dummy2], 
         dummy = year_mch < year) %>% 
  unnest() 
DF2

由于无穷大，它会给出一系列警告，但这不会影响结果。

Answer 4

因此，此解决方案是纯R的基础。我曾经读过一篇文章，声称使用. <-是%>%的有效替代品。这是我第一次尝试。我想我喜欢

. <- DF[c('id', 'name', 'year')]
. <- merge(., ., by = 'name')
. <- .[.["id.x"] != .["id.y"] & .["year.x"] < .["year.y"],]
. <- .[c('id.x', 'id.y', 'year.x', 'year.y', "name")] 
.$n <- 1
. <- aggregate(n ~ id.x + id.y, data = ., sum) 
. <- .[.['n'] >= 3, 'id.y']

DF$dummy2 <- . == DF$id

Answer 5

在OP对速度和内存问题进行编辑之后，如何使用Rcpp方法：

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

#convert name into an integer code
DF[DF[,.(name=unique(name))][, IntCode := .I], iname := IntCode, on=.(name)]

library(inline)
library(Rcpp)
cppFunction('
NumericVector hasOccur(NumericVector nid, NumericVector year, List iname) {
    List namelist(iname);
    int sz = namelist.size(), i, j, m, n, nPrev, nCurr, count;
    NumericVector res(sz);

    for(i=0; i<sz; i++) {
        for(j=0; j<i; j++) {
            if (nid[j] < nid[i] && year[j] < year[i]) {
                SEXP prevList = namelist[j];
                SEXP currList = namelist[i];

                NumericVector cl(currList);
                NumericVector pl(prevList);
                nPrev = pl.size();
                nCurr = cl.size();

                res[i] = 0;
                count = 0;
                for(m=0; m<nCurr; m++) {
                    for (n=0; n<nPrev; n++) {
                        if (cl[m] == pl[n]) {
                            count++;
                            break;
                        }
                    }
                }

                if (count >= 3) {
                    res[i] = 1;
                    break;
                }
            }
        }
    }

    return(res);
}')

d <- DF[, .(.(nm=iname)), by=.(nid, year)]
DF[d[, dummy := hasOccur(d$nid, d$year, d$V1)], dummy := dummy, on=.(nid, year)]

HTH。

另一种可能的data.table方法：

#create a integer column out of id for non-equi join later
setDT(DF)[, nid := rleid(id)]

          #self non-equi join
check3 <- DF[DF, .(x.id, x.name, x.year, x.nid, i.id, i.name, i.year, i.nid), on=.(nid<nid, year<year, name=name)][,
    #count the number of occurrence in previous id and year
    uniqueN(x.name, na.rm=TRUE), by=.(i.id, i.year, x.id, x.year)][,
        #check if more than 3
        any(V1 >= 3L), by=.(i.id, i.year)]

#update join to add result to original DF
DF[check3, dummy := as.integer(V1), on=c("id"="i.id", "year"="i.year")]

Answer 6

我将以任何借口将数据问题转换为图形问题，因此请坦率地提出。这是一个igraph解决方案。本质上，它将数据转换为有向树。仅将所有节点与层次结构中较高级别的节点进行比较。因此， C 是树的顶部，不与其他任何事物进行比较，而 J 是终端，并且与链中位于其上方的所有节点进行比较。要提取层次结构中较高的所有节点，您需要做的就是使用（深度优先搜索）dfs函数

library(tidyverse)
library(igraph)

#node list containing data specific to the group
nodelist <- DF %>%
  group_by(id, year) %>%
  nest()

#edge list containing connections. A group directly before a node points toward a future group
edgelist <- data.frame(
  from = nodelist$id %>% .[1:(length(.)-1)],
  to = nodelist$id %>% .[2:length(.)]
)

#create the data frame
g <- graph_from_data_frame(edgelist, T, nodelist)

#let's iterate through the nodes
dummy <- map_lgl(V(g)$name, function(vertex){

  #depth first search to pull out all nodes higher up on the tree 
  full_path <- dfs(g, vertex, 'in', unreachable = F) %>%
    .$order %>% 
    .[!is.na(.)] 

  #if there is no node higher up, then we're done
  if(length(full_path) <= 1) return(F)

  #The first node returned is the node we're iterating over
  this_vertex <- full_path[1]
  other_vertices <- full_path[full_path != this_vertex]

  #this is the logic for the dummy variable
  similar_groups <- map_lgl(other_vertices, function(other_vertex){
    (sum(this_vertex$data[[1]]$name %in% other_vertex$data$name) >= 3) & 
      (this_vertex$year[[1]] != other_vertex$year)
  })

  return(T %in% similar_groups)
})

V(g)$dummy2 <- dummy

as_data_frame(g, 'vertices') %>%
  unnest()

   name year dummy2 name1 dummy
1     c 2000  FALSE    af     0
2     c 2000  FALSE    el     0
3     c 2000  FALSE    in     0
4     c 2000  FALSE    ud     0
5     d 2000  FALSE    ot     0
6     d 2000  FALSE    an     0
7     d 2000  FALSE    el     0
8     d 2000  FALSE    un     0
9     f 2002  FALSE    yt     0
10    f 2002  FALSE    ip     0
11    f 2002  FALSE    ot     0
12    f 2002  FALSE    el     0
13    g 2003  FALSE    yt     0
14    g 2003  FALSE    af     0
15    g 2003  FALSE    ol     0
16    g 2003  FALSE    in     0
17    h 2003  FALSE    in     0
18    h 2003  FALSE    eg     0
19    h 2003  FALSE    yt     0
20    h 2003  FALSE    af     0
21    j 2004   TRUE    ot     1
22    j 2004   TRUE    el     1
23    j 2004   TRUE    ip     1
24    j 2004   TRUE    yt     1

虚拟变量取决于分组观察中的重复次数

编辑

原始帖子

6 个答案: