寻找类似的"使用sqldf执行条件连接的行

时间:2016-05-17 08:14:36

标签: r sqldf

说我有一个data.table(也可能是data.frame,对我来说不重要),它有数字列a,b,c,d和e。 表格的每一行代表一篇文章,而a-e是文章的数字特征。

根据列a,b和c,我想知道哪些文章彼此相似。 我定义"类似"允许a,b和c最多变化+/- 1。 也就是说,如果a,b和c之间的差异不超过1,则第x条与第y条相似。它们对d和e的值并不重要,可能会有很大差异。

我已经尝试了几种方法,但没有得到理想的结果。我想要实现的是获取一个结果表,其中只包含那些与至少另一行相似的行。另外,必须排除重复。

特别是,我想知道使用 sqldf 库是否可行。我的想法是在给定的条件下以某种方式加入桌子,但我不能正确地将它组合在一起。任何想法(不一定使用 sqldf )?

2 个答案:

答案 0 :(得分:1)

Suppose our input data frame is the built-in 11x8 anscombe data frame. Its first three column names are x1, x2 and x3. Then here are some solutions.

1) sqldf This returns the pairs of row numbers of similar rows:

library(sqldf)

ans <- anscombe
ans$id <- 1:nrow(ans)

sqldf("select a.id, b.id 
       from ans a 
       join ans b on abs(a.x1 - b.x1) <= 1 and 
                     abs(a.x2 - b.x2) <= 1 and 
                     abs(a.x3 - b.x3) <= 1")

Add another condition and a.id < b.id if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id to just exclude self pairs.

2) dist This returns a matrix m whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.

# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0

giving:

   1 2 3 4 5 6 7 8 9 10 11
1  1 0 0 1 1 0 0 0 0  0  0
2  0 1 0 1 0 0 0 0 0  1  0
3  0 0 1 0 0 1 0 0 1  0  0
4  1 1 0 1 0 0 0 0 0  0  0
5  1 0 0 0 1 0 0 0 1  0  0
6  0 0 1 0 0 1 0 0 0  0  0
7  0 0 0 0 0 0 1 0 0  1  1
8  0 0 0 0 0 0 0 1 0  0  1
9  0 0 1 0 1 0 0 0 1  0  0
10 0 1 0 0 0 0 1 0 0  1  0
11 0 0 0 0 0 0 1 1 0  0  1

We could add m[lower.tri(m, diag = TRUE)] <- 0 to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0 to just exclude self pairs.

We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.

# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]

giving:

    Var1 Var2
34     1    4
35     2    4
45     1    5
58     3    6
91     3    9
93     5    9
101    2   10
106    7   10
117    7   11
118    8   11

Here is a network graph of the above. Note that answer continues after the graph:

# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)

screenshot

# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) + 
       geom_raster()

screenshot

答案 1 :(得分:0)

我对R很陌生,所以不要期待太多。

如果您从值(基本上是矢量)创建一个距离两个值的距离的矩阵,该怎么办?因此,您可以找到相互之间差异小于1的组合。通过这种方式,您可以找到匹配的(a)对。用(b)和(c)重复此步骤,找到所有对中包含的那些。

或者,这也可以作为一个立方体来完成。

就像一个想法暗示。