我需要计算与以下数据帧的jaccard相似度:
df = data.frame(
a=c("1", "1", "1", "1", "2", "2", "2", "3", "3", "4", "4", "4", "4"),
b=c("100", "101", "111", "25841", "111", "101", "106", "101", "108", "100", "30256", "108", "112"))
我需要将数据转换为二进制集吗?这是怎么回事?
100 101 111 25841 106 108 30256 112
1 1 1 1 1 0 0 0 0
2 0 1 1 0 1 0 0 0
3 0 1 0 0 0 1 0 0
4 1 0 0 0 0 1 1 1
使用jaccard< - vegdist(df,method =“jaccard”)
答案 0 :(得分:0)
Jaccard索引可以从二进制表派生。请参阅此wikipedia文章。
在这里,我展示了另一种获取jaccard索引的方法。
# Data
df = data.frame(
a=c("1", "1", "1", "1", "2", "2", "2", "3", "3", "4", "4", "4", "4"),
b=c("100", "101", "111", "25841", "111", "101", "106", "101", "108", "100", "30256", "108", "112"),
stringsAsFactors = FALSE)
library('data.table')
setDT(df)
# jaccard index
jaccard_index <- function(x,y)
{
x_int <- intersect(x,y) # xny
x_union <- union(x,y) # xuy
return( length(x_int)/length(x_union))
}
ji <- combn(unique(df$a), 2, FUN = function(z){
x <- df[ a %in% z[1], b]
y <- df[ a %in% z[2], b]
jaccard_index(x,y)
})
ji <- setNames( ji, combn(unique(df$a), 2, FUN = paste0, collapse = ""))
ji
# 12 13 14 23 24 34
# 0.4000000 0.2000000 0.1428571 0.2500000 0.0000000 0.2000000
# jaccard distance
jd <- 1- ji
jd
# 12 13 14 23 24 34
# 0.6000000 0.8000000 0.8571429 0.7500000 1.0000000 0.8000000
使用示例here中的测试数据。它还显示了预期产出作为参考点:
# test data
test <- data.frame( a = c(rep("A",5), rep("B", 7)),
b = c(0,1,2,5,6,0,2,3,4,5,7,9),
stringsAsFactors = FALSE)
setDT(test)
# jaccard index
ji_test <- combn(unique(test$a), 2, FUN = function(z){
x <- test[ a %in% z[1], b]
y <- test[ a %in% z[2], b]
jaccard_index(x,y)
})
ji_test <- setNames( ji_test, combn(unique(test$a), 2, FUN = paste0, collapse = ""))
ji_test
# AB
# 0.3333333
# jaccard distance
jd_test <- 1- ji_test
jd_test
# AB
# 0.6666667