说我有一个data.table(也可能是data.frame,对我来说不重要),它有数字列a,b,c,d和e。 表格的每一行代表一篇文章,而a-e是文章的数字特征。
根据列a,b和c,我想知道哪些文章彼此相似。 我定义"类似"允许a,b和c最多变化+/- 1。 也就是说,如果a,b和c之间的差异不超过1,则第x条与第y条相似。它们对d和e的值并不重要,可能会有很大差异。
我已经尝试了几种方法,但没有得到理想的结果。我想要实现的是获取一个结果表,其中只包含那些与至少另一行相似的行。另外,必须排除重复。
特别是,我想知道使用 sqldf 库是否可行。我的想法是在给定的条件下以某种方式加入桌子,但我不能正确地将它组合在一起。任何想法(不一定使用 sqldf )?
答案 0 :(得分:1)
Suppose our input data frame is the built-in 11x8 anscombe
data frame. Its first three column names are x1
, x2
and x3
. Then here are some solutions.
1) sqldf This returns the pairs of row numbers of similar rows:
library(sqldf)
ans <- anscombe
ans$id <- 1:nrow(ans)
sqldf("select a.id, b.id
from ans a
join ans b on abs(a.x1 - b.x1) <= 1 and
abs(a.x2 - b.x2) <= 1 and
abs(a.x3 - b.x3) <= 1")
Add another condition and a.id < b.id
if each row should not be paired with itself and if we want to exclude the reverse of each pair or add and not a.id = b.id
to just exclude self pairs.
2) dist This returns a matrix m
whose i,j-th element is 1 if rows i and j are similar and 0 if not based on columns 1, 2 and 3.
# matrix of pairs (1 = similar, 0 = not)
m <- (as.matrix(dist(anscombe[1:3], method = "maximum")) <= 1) + 0
giving:
1 2 3 4 5 6 7 8 9 10 11
1 1 0 0 1 1 0 0 0 0 0 0
2 0 1 0 1 0 0 0 0 0 1 0
3 0 0 1 0 0 1 0 0 1 0 0
4 1 1 0 1 0 0 0 0 0 0 0
5 1 0 0 0 1 0 0 0 1 0 0
6 0 0 1 0 0 1 0 0 0 0 0
7 0 0 0 0 0 0 1 0 0 1 1
8 0 0 0 0 0 0 0 1 0 0 1
9 0 0 1 0 1 0 0 0 1 0 0
10 0 1 0 0 0 0 1 0 0 1 0
11 0 0 0 0 0 0 1 1 0 0 1
We could add m[lower.tri(m, diag = TRUE)] <- 0
to exclude self pairs and the reverse of each pair if desired or diag(m) <- 0
to just exclude self pairs.
We can create a data frame of similar row number pairs like this. To keep the output short we have excluded self pairs and the reverse of each pair.
# two-column data.frame of pairs excluding self pairs and reverses
subset(as.data.frame.table(m), c(Var1) < c(Var2) & Freq == 1)[1:2]
giving:
Var1 Var2
34 1 4
35 2 4
45 1 5
58 3 6
91 3 9
93 5 9
101 2 10
106 7 10
117 7 11
118 8 11
Here is a network graph of the above. Note that answer continues after the graph:
# network graph
library(igraph)
g <- graph.adjacency(m)
plot(g)
# raster plot
library(ggplot2)
ggplot(as.data.frame.table(m), aes(Var1, Var2, fill = factor(Freq))) +
geom_raster()
答案 1 :(得分:0)
我对R很陌生,所以不要期待太多。
如果您从值(基本上是矢量)创建一个距离两个值的距离的矩阵,该怎么办?因此,您可以找到相互之间差异小于1的组合。通过这种方式,您可以找到匹配的(a)对。用(b)和(c)重复此步骤,找到所有对中包含的那些。
或者,这也可以作为一个立方体来完成。
就像一个想法暗示。