将每一行与data.frame中的所有其他行进行比较

时间:2018-08-21 14:05:44

标签: r tidyverse

我有一个小标题,每一行对应一个具有ID的样本,每个样本将具有多个带有值的类别。我想列出两个或多个类别的值彼此匹配的ID对。

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>

我可以使用sum(match(x$data[[i]], x$data[[j]], nomatch=0)>0)来获取两行之间的匹配计数。

我可以通过将match()放在ij循环中来得到答案。有没有更好的方法(例如,使用tidyverse进行整洁)来获得答案?

谢谢。

例如,这是原始data.frame:

   ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue

输出将如下所示:

  ID.a ID.b matches
1  ID1  ID2       2
2  ID1  ID3       1
3  ID2  ID3       2

我使用tidyverse将输入data.frame转换为如开头所示的小节,然后在循环中使用match

# A tibble: 3 x 2
ID    data            
<chr> <list>          
1 ID1   <tibble [1 x 3]>
2 ID2   <tibble [1 x 3]>
3 ID3   <tibble [1 x 3]>

4 个答案:

答案 0 :(得分:2)

这里是通过使用base Rtable使用crossprod的选项。将crossprod的矩阵输出的下三角值设置为NA,通过转换为data.frame然后将subset以外的行转换为'long'格式“频率”列的不适用

out <- with(df, crossprod(table(paste(category, value), ID)))
out[lower.tri(out, diag = TRUE)] <- NA
subset(as.data.frame.table(out), !is.na(Freq))
#    ID ID.1 Freq
#4 ID1  ID2    2
#7 ID1  ID3    1
#8 ID2  ID3    2

数据

df <- structure(list(ID = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID2", 
"ID3", "ID3", "ID3"), category = c("length", "type", "color", 
 "length", "type", "color", "length", "type", "color"), 
 value = c("100", 
 "L", "Blue", "100", "M", "Blue", "150", "M", "Blue")), 
 class = "data.frame", row.names = c(NA, -9L))

答案 1 :(得分:1)

df = read.table(text="
ID category value
ID1   length   100
ID1     type     L
ID1    color  Blue
ID2   length   100
ID2     type     M
ID2    color  Blue
ID3   length   150
ID3     type     M
ID3    color  Blue
", header=T, stringsAsFactors = F)

library(tidyverse)

# create a new column that combines category and value
df = df %>% unite(cat_val, category, value, remove = F)

# create vectorised function that counts matches (on that new value)
f = function(x,y) sum(df$cat_val[df$ID == x] == df$cat_val[df$ID == y])
f = Vectorize(f)

data.frame(t(combn(unique(df$ID), 2))) %>%  # create combinations of IDs (as a dataframe)
  mutate(matches = f(X1, X2))               # apply function

#    X1  X2 matches
# 1 ID1 ID2       2
# 2 ID1 ID3       1
# 3 ID2 ID3       2

答案 2 :(得分:1)

为了完整起见,这也是使用 self-join 的解决方案:

library(data.table)
setDT(x)[x, on = .(category, value), allow = TRUE][
  ID < i.ID, .N, by = .(ID1 = ID, ID2 = i.ID)]
   ID1 ID2 N
1: ID1 ID2 2
2: ID2 ID3 2
3: ID1 ID3 1

数据

x <- readr::read_table(
"i  ID category value
1 ID1   length   100
2 ID1     type     L
3 ID1    color  Blue
4 ID2   length   100
5 ID2     type     M
6 ID2    color  Blue
7 ID3   length   150
8 ID3     type     M
9 ID3    color  Blue")[, -1L]

答案 3 :(得分:0)

不确定这是否比较整洁,但是我们可以执行以下操作:

library(tidyverse)

combn_join <- function(x) {
  map2(combn(1:3, 2)[1,], combn(1:3, 2)[2,],
       ~ left_join(x[[.x]], x[[.y]], by = c("category", "value")) %>% 
         select(ID.x, ID.y))
}  

df %>%
  split(.$ID) %>%
  combn_join(.) %>%
  do.call(rbind, .) %>%
  filter(!is.na(ID.y)) %>%
  group_by(ID.x, ID.y) %>%
  summarize(matches = n())

结果:

# A tibble: 3 x 3
# Groups:   ID.x [?]
  ID.x  ID.y  matches
  <fct> <fct>   <int>
1 ID1   ID2         2
2 ID1   ID3         1
3 ID2   ID3         2

数据:

df <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 
3L), .Label = c("ID1", "ID2", "ID3"), class = "factor"), category = structure(c(2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("color", "length", 
"type"), class = "factor"), value = structure(c(1L, 4L, 3L, 1L, 
5L, 3L, 2L, 5L, 3L), .Label = c("100", "150", "Blue", "L", "M"
), class = "factor")), .Names = c("ID", "category", "value"), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9"))