我有一个小标题,每一行对应一个具有ID的样本,每个样本将具有多个带有值的类别。我想列出两个或多个类别的值彼此匹配的ID对。
# A tibble: 3 x 2
ID data
<chr> <list>
1 ID1 <tibble [1 x 3]>
2 ID2 <tibble [1 x 3]>
3 ID3 <tibble [1 x 3]>
我可以使用sum(match(x$data[[i]], x$data[[j]], nomatch=0)>0)
来获取两行之间的匹配计数。
我可以通过将match()
放在i
和j
循环中来得到答案。有没有更好的方法(例如,使用tidyverse
进行整洁)来获得答案?
谢谢。
例如,这是原始data.frame:
ID category value
1 ID1 length 100
2 ID1 type L
3 ID1 color Blue
4 ID2 length 100
5 ID2 type M
6 ID2 color Blue
7 ID3 length 150
8 ID3 type M
9 ID3 color Blue
输出将如下所示:
ID.a ID.b matches
1 ID1 ID2 2
2 ID1 ID3 1
3 ID2 ID3 2
我使用tidyverse
将输入data.frame转换为如开头所示的小节,然后在循环中使用match
。
# A tibble: 3 x 2
ID data
<chr> <list>
1 ID1 <tibble [1 x 3]>
2 ID2 <tibble [1 x 3]>
3 ID3 <tibble [1 x 3]>
答案 0 :(得分:2)
这里是通过使用base R
和table
使用crossprod
的选项。将crossprod
的矩阵输出的下三角值设置为NA
,通过转换为data.frame
然后将subset
以外的行转换为'long'格式“频率”列的不适用
out <- with(df, crossprod(table(paste(category, value), ID)))
out[lower.tri(out, diag = TRUE)] <- NA
subset(as.data.frame.table(out), !is.na(Freq))
# ID ID.1 Freq
#4 ID1 ID2 2
#7 ID1 ID3 1
#8 ID2 ID3 2
df <- structure(list(ID = c("ID1", "ID1", "ID1", "ID2", "ID2", "ID2",
"ID3", "ID3", "ID3"), category = c("length", "type", "color",
"length", "type", "color", "length", "type", "color"),
value = c("100",
"L", "Blue", "100", "M", "Blue", "150", "M", "Blue")),
class = "data.frame", row.names = c(NA, -9L))
答案 1 :(得分:1)
df = read.table(text="
ID category value
ID1 length 100
ID1 type L
ID1 color Blue
ID2 length 100
ID2 type M
ID2 color Blue
ID3 length 150
ID3 type M
ID3 color Blue
", header=T, stringsAsFactors = F)
library(tidyverse)
# create a new column that combines category and value
df = df %>% unite(cat_val, category, value, remove = F)
# create vectorised function that counts matches (on that new value)
f = function(x,y) sum(df$cat_val[df$ID == x] == df$cat_val[df$ID == y])
f = Vectorize(f)
data.frame(t(combn(unique(df$ID), 2))) %>% # create combinations of IDs (as a dataframe)
mutate(matches = f(X1, X2)) # apply function
# X1 X2 matches
# 1 ID1 ID2 2
# 2 ID1 ID3 1
# 3 ID2 ID3 2
答案 2 :(得分:1)
为了完整起见,这也是使用 self-join 的解决方案:
library(data.table)
setDT(x)[x, on = .(category, value), allow = TRUE][
ID < i.ID, .N, by = .(ID1 = ID, ID2 = i.ID)]
ID1 ID2 N 1: ID1 ID2 2 2: ID2 ID3 2 3: ID1 ID3 1
x <- readr::read_table(
"i ID category value
1 ID1 length 100
2 ID1 type L
3 ID1 color Blue
4 ID2 length 100
5 ID2 type M
6 ID2 color Blue
7 ID3 length 150
8 ID3 type M
9 ID3 color Blue")[, -1L]
答案 3 :(得分:0)
不确定这是否比较整洁,但是我们可以执行以下操作:
library(tidyverse)
combn_join <- function(x) {
map2(combn(1:3, 2)[1,], combn(1:3, 2)[2,],
~ left_join(x[[.x]], x[[.y]], by = c("category", "value")) %>%
select(ID.x, ID.y))
}
df %>%
split(.$ID) %>%
combn_join(.) %>%
do.call(rbind, .) %>%
filter(!is.na(ID.y)) %>%
group_by(ID.x, ID.y) %>%
summarize(matches = n())
结果:
# A tibble: 3 x 3
# Groups: ID.x [?]
ID.x ID.y matches
<fct> <fct> <int>
1 ID1 ID2 2
2 ID1 ID3 1
3 ID2 ID3 2
数据:
df <- structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L), .Label = c("ID1", "ID2", "ID3"), class = "factor"), category = structure(c(2L,
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), .Label = c("color", "length",
"type"), class = "factor"), value = structure(c(1L, 4L, 3L, 1L,
5L, 3L, 2L, 5L, 3L), .Label = c("100", "150", "Blue", "L", "M"
), class = "factor")), .Names = c("ID", "category", "value"), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9"))