nhesms哈希:文档指纹识别

时间:2011-11-12 13:29:48

标签: r hash text-mining fingerprinting

我正在尝试在R中实现文档指纹识别的风选算法。

此处为参考http://www.ida.liu.se/~TDDC03/oldprojects/2005/final-projects/prj10.pdf

我的问题:

如何获得n-gram的哈希值以及如何选择

nGrams <- c("adoru", "dorun", "orunr", "runru", "unrun", "nrunr" ,"runru",
  "unrun","nruna", "runad", "unado", "nador", "adoru", "dorun", "orunr" ,"runru" ,
  "unrun")

1 个答案:

答案 0 :(得分:1)

似乎

library(digest)
v <- sapply(nGrams,digest,algo="crc32")
uv <- unique(v)
(as.integer(as.hexmode(uv))-1) %% 4 == 0

将是一个良好的开端。 (CRC32总是奇数,所以减去1是必要的。)