向量找到数据帧行的邻居

时间:2018-09-12 09:14:40

标签: r dataframe vectorization

我有两个数据框,下面是每个数据框的一个小样本:

df1 <- data.frame(a1= c(3,4), a2 = c(8, 8), a3 = c(4, 18), a4 = c(9,9), a5 = c(17, 30))

df2 <- data.frame(a1 = c(2,2,2,3,3,3,4,4,4), a2 = c(7,7,7,7,7,7,7,7,7), 
                 a3 = c(4,4,4,4,4,4,4,4,4), a4 = c(10,10,10, 10, 10, 10, 10,10,10), 
                 a5 = c(15,16,17, 15, 16, 17, 15, 16, 17))

我想检查df1的每一行,在df2中是否有“邻居”,其中,邻居,我的意思是观察到的差异最多不超过1 每列(绝对值)。例如,df2的第2行是df1中第1行的邻居。

我目前的操作方式如下:

sweep(as.matrix(df2), 2, as.matrix(df1[1,]), "-")

对于df1的第1行,我必须对df1的每一行重复此操作。 请注意,df2和df1的行数不同。

但是,我真正想要的是避免“按行”执行此操作,因为我的数据帧有很多行。有矢量方法吗?

4 个答案:

答案 0 :(得分:2)

您可以使用将df1的行拆分为一个列表,然后使用lapply来实现矢量化:

my_list=lapply(as.list(data.frame(t(df1))),function(x) sweep(as.matrix(df2), 2, as.matrix(x), "-"))

my_list的每个元素是df1中每一行的计算结果

my_list[[1]]
      a1 a2 a3 a4 a5
 [1,] -1 -1  0  1 -2
 [2,] -1 -1  0  1 -1
 [3,] -1 -1  0  1  0
 [4,]  0 -1  0  1 -2
 [5,]  0 -1  0  1 -1
 [6,]  0 -1  0  1  0
 [7,]  1 -1  0  1 -2
 [8,]  1 -1  0  1 -1
 [9,]  1 -1  0  1  0

此外,您可以使用比传统parallel::mclapply更快的lapply

答案 1 :(得分:2)

这是使用非等额联接的一种data.table方法

library(data.table)
cols <- names(df2)

#convert into data.table and add row index for clarity
setDT(df1)[, rn1 := .I]
setDT(df2)[, rn2 := .I]

#create a lower (-1) and upper (+1) bound on each column
bandsNames <- paste0(rep(cols, each=2L), "_", rep(c("lower", "upper"), length(cols)))
df2Bands <- df2[, 
    {
        ans <- do.call(cbind, lapply(.SD, function(x) outer(x, c(-1L, 1L), `+`)))
        setnames(data.table(ans), bandsNames)
    }, by=.(rn2)]

#create the non-equi join conditions
lowerLimits <- paste0(cols, "_lower<=", cols)
upperLimits <- paste0(cols, "_upper>=", cols)

#perform the non-equi join on lower and upper limits and return the count
#`:=` add a new column in df1 by reference
df1[, Count := 
        df2Bands[df1, .N, by=.EACHI, on=c(lowerLimits, upperLimits)]$N
    ]

所需的输出:

   a1 a2 a3 a4 a5 rn1 Count
1:  3  8  4  9 17   1     6
2:  4  8 18  9 30   2     0

如果您还想找到匹配的行:

df2Bands[df1, .(rn1=i.rn1, rn2=x.rn2), by=.EACHI, on=c(lowerLimits, upperLimits)][, 
    -(1L:length(bandsNames))]

匹配的行:

   rn1 rn2
1:   1   2
2:   1   3
3:   1   5
4:   1   6
5:   1   8
6:   1   9
7:   2  NA

答案 2 :(得分:2)

我认为没有一个很好的方法可以完全向量化此问题,(应用家庭实际上仅是领结中的循环)。但是您可以按列而不是按行进行操作。如果需要进一步改进,则可以在每列之后通过删除可能无法匹配的行来减小问题的大小(这将导致索引头痛,但相对可行)。

我的尝试在下面使用for循环(可以用lapply代替)。 它返回一个真矩阵,可以将具有1的行与具有1的列匹配,从而给出邻居的配对。

col_comp = function(x,y)
{
    lx = length(x)
    ly = length(y)
    return(abs(rep(x,ly) - rep(y,each = lx) )<=1)
}

full_comp=function(df1,df2)
{
    rows1 = seq_len(nrow(df1))
    rows2 = seq_len(nrow(df2))
    M = matrix(1L, nrow=nrow(df1),ncol=nrow(df2))
    for(i in seq_len(ncol(df1)) )
    {
        matches = col_comp(df1[rows1,i],df2[rows2,i])

        M = M*matches
    }
    return(M)
}

答案 3 :(得分:2)

使用library(sqldf)的解决方案:

library(sqldf)

sqldf( "select df2.*, df1.rowid as df1_idx
        from df2 left join df1
           on df2.a1 between df1.a1-1 and df1.a1+1
          and df2.a2 between df1.a2-1 and df1.a2+1
          and df2.a3 between df1.a3-1 and df1.a3+1
          and df2.a4 between df1.a4-1 and df1.a4+1
          and df2.a5 between df1.a5-1 and df1.a5+1")

  a1 a2 a3 a4 a5 df1_idx
1  2  7  4 10 15     NA
2  2  7  4 10 16      1
3  2  7  4 10 17      1
4  3  7  4 10 15     NA
5  3  7  4 10 16      1
6  3  7  4 10 17      1
7  4  7  4 10 15     NA
8  4  7  4 10 16      1
9  4  7  4 10 17      1

编辑以显示任意数量列的解决方案:

library(sqldf)

cnames <- colnames(df1)

# main body of your sql
sql_main <- "select df2.*, df1.rowid as df1_idx
            from df2 left join df1
            on 1=1"

# join conditions (which will be added to above)
join_conditions <- 
  paste0( ' and df2.', cnames, ' BETWEEN df1.', cnames, '-1',
                                   ' AND df1.', cnames, '+1',
          collapse = '')

sql <- paste(sql_main, join_conditions)

sqldf(sql)