在数据表上应用(余弦)相似性度量

时间:2017-05-03 02:31:34

标签: r data.table apply similarity cosine-similarity

我正在寻找一种合理的方法来确定项目团队成员之间的相似性,这些成员在四个方面都得分。

下面添加了一个数据摘录,并在dput的问题末尾添加了一个稍大的示例

pnum invid dom_st prim_st pat_st net_st
 1: 7265873 24104      0       1      1      0
 2: 7266757 38775      1       2      2      3
 3: 7266757 38776      1       2      2      3
 4: 7268524 34281      1       3      2      2
 5: 7268524 34282      1       3      2      2
 6: 7272620 20002      0       1      2      0
 7: 7272620 22284      0       1      2      0
 8: 7273253 31921      1       1      1      4
 9: 7273253 31922      1       1      1      4
10: 7283628 26841      1       1      1      2
11: 7283628 26843      1       1      1      2
12: 7289442 17763      2      11     48     10
13: 7289442 17764      2      11     63      9
14: 7289525 38087      0       1      1      0
15: 7289525 38088      0       2      1      0
16: 7289525 38089      0       3      1      1

目标是为每个'pnum'创建一个相似性度量,用于比较所有'invid'中的最后四个列值。每个'pnum'的'invid'数量在2到26之间变化。

编辑1: 具体来说,对于'pnum'7266757(第2行和第3行),我想要在invid 38775(1,2,2,3)和invid 38776(1,2,2,3)之间的相似性,所以这个应该给出一个1.对于'pnum'7289525(第14-16行),我想要三个行向量(0,1,1,0),(0,2,1,0)和(0, 3,1,1)。这给出了以下内容:

simil(matrix(c(0,1,1,0,0,2,1,0,0,3,1,1), nrow = 3, byrow = TRUE), method = "cosine")
          1         2
2 0.9486833          
3 0.8528029 0.9438798

在最后一步(可能是一个单独的公式)中,我希望“将”矩阵(对于n> 2的团队)“减少”为理想情况下将在0和1之间约束的单个值。一种简单的方法这样做只是采取矩阵结果的平均值,但也许有一个更聪明的方法?

我尝试了以下内容(数据存储在data.table'dt'中,但是出现了以下错误:

library('proxy')    
sim <- dt[, simil(dt, method="cosine"), by = pnum]
    Error in .Call("R_cosine", c(4262069, 4262069, 4262069, 4273567, 4273567, : negative length vectors are not allowed

任何建议更成功地将此功能或类似功能应用于data.table和创意如何将相似性矩阵降低到单个点值将非常受欢迎。

总数据集约为150,000行,约92,000个项目'pnum'。

structure(list(pnum = c(7265873, 7266757, 7266757, 7268524, 7268524, 
7272620, 7272620, 7273253, 7273253, 7283628, 7283628, 7289442, 
7289442, 7289525, 7289525, 7289525, 7301987, 7301987, 7305259, 
7305259, 7307986, 7307986, 7310332, 7310332, 7333490, 7333490, 
7333502, 7333502, 7414991, 7414991), invid = c(24104, 38775, 
38776, 34281, 34282, 20002, 22284, 31921, 31922, 26841, 26843, 
17763, 17764, 38087, 38088, 38089, 34843, 38412, 32514, 33946, 
28587, 28588, 17204, 17205, 28587, 28588, 28587, 28588, 37008, 
37009), dom_st = c(0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 
0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0), prim_st = c(1, 
2, 2, 3, 3, 1, 1, 1, 1, 1, 1, 11, 11, 1, 2, 3, 3, 3, 1, 1, 5, 
5, 3, 3, 5, 5, 5, 5, 3, 3), pat_st = c(1, 2, 2, 2, 2, 2, 2, 1, 
1, 1, 1, 48, 63, 1, 1, 1, 1, 1, 1, 1, 5, 5, 14, 14, 5, 5, 5, 
5, 1, 1), net_st = c(0, 3, 3, 2, 2, 0, 0, 4, 4, 2, 2, 10, 9, 
0, 0, 1, 2, 2, 0, 0, 2, 2, 4, 4, 2, 2, 2, 2, 0, 0)), .Names = c("pnum", 
"invid", "dom_st", "prim_st", "pat_st", "net_st"), class = c("data.table", 
"data.frame"), row.names = c(NA, -30L), .internal.selfref = <pointer: 0x0000000000230788>)

1 个答案:

答案 0 :(得分:2)

这对我有用:

1. adb shell
2. run-as com.your.package
3. ls -> You would see the databases here.
4. cp /data/data/com.your.package/databases/you-db-name  /sdcard/file_to_write"

注意:我需要将library(data.table) setDT(DT) # find relevant columns for call to simil cols <- stringr::str_subset(names(DT), "_st$") cols #[1] "dom_st" "prim_st" "pat_st" "net_st" DT[, (mean(proxy::simil(.SD, method="cosine"))), .SDcols = cols, by = pnum] # pnum V1 # 1: 7265873 NaN # 2: 7266757 1.0000000 # 3: 7268524 1.0000000 # 4: 7272620 1.0000000 # 5: 7273253 1.0000000 # 6: 7283628 1.0000000 # 7: 7289442 0.9968006 # 8: 7289525 0.9151220 # 9: 7301987 1.0000000 #10: 7305259 1.0000000 #11: 7307986 1.0000000 #12: 7310332 1.0000000 #13: 7333490 1.0000000 #14: 7333502 1.0000000 #15: 7414991 1.0000000 表达式包装在parantheses中。没有,我收到一条我不明白的错误信息:

j
  

FUN错误(X [[i]],...):
    列无效:它有尺寸。无法格式化它。如果它是data.table(table())的结果,请使用as.data.table(table())。

编辑1

如果你想得到每个DT[, mean(proxy::simil(.SD, method="cosine")), .SDcols = cols, by = pnum] 的相似性矩阵(在对它们求平均值之前),我建议使用pnum返回一个列表:

lapply()

编辑2

OP增加了一项额外要求,即他想为每个pnums <- DT[, unique(pnum)] results <- lapply(pnums, function(x) { proxy::simil(DT[pnum == x, cols, with = FALSE], method="cosine") }) setNames(results, pnums) #$`7265873` #simil(0) # #$`7266757` # 1 #2 1 # #$`7268524` # 1 #2 1 # #$`7272620` # 1 #2 1 # #$`7273253` # 1 #2 1 # #$`7283628` # 1 #2 1 # #$`7289442` # 1 #2 0.9968006 # #$`7289525` # 1 2 #2 0.9486833 #3 0.8528029 0.9438798 # #$`7301987` # 1 #2 1 # #$`7305259` # 1 #2 1 # #$`7307986` # 1 #2 1 # #$`7310332` # 1 #2 1 # #$`7333490` # 1 #2 1 # #$`7333502` # 1 #2 1 # #$`7414991` # 1 #2 1 计算一些汇总值。这可以通过

来实现
pnum

数据

DT[, {
  sim_mat <- proxy::simil(.SD, method="cosine")
  list(min = min(sim_mat), max = max(sim_mat), 
       mean = mean(sim_mat), sd = sd(sim_mat))
}, .SDcols = cols, by = pnum]
#       pnum       min       max      mean         sd
# 1: 7265873       Inf      -Inf       NaN         NA
# 2: 7266757 1.0000000 1.0000000 1.0000000         NA
# 3: 7268524 1.0000000 1.0000000 1.0000000         NA
# 4: 7272620 1.0000000 1.0000000 1.0000000         NA
# 5: 7273253 1.0000000 1.0000000 1.0000000         NA
# 6: 7283628 1.0000000 1.0000000 1.0000000         NA
# 7: 7289442 0.9968006 0.9968006 0.9968006         NA
# 8: 7289525 0.8528029 0.9486833 0.9151220 0.05402336
# 9: 7301987 1.0000000 1.0000000 1.0000000         NA
#10: 7305259 1.0000000 1.0000000 1.0000000         NA
#11: 7307986 1.0000000 1.0000000 1.0000000         NA
#12: 7310332 1.0000000 1.0000000 1.0000000         NA
#13: 7333490 1.0000000 1.0000000 1.0000000         NA
#14: 7333502 1.0000000 1.0000000 1.0000000         NA
#15: 7414991 1.0000000 1.0000000 1.0000000         NA