我目前正在基于以下模块尝试使用minhash函数算法来测试两个文本(或文档)之间的相似性:https://github.com/sjhorn/node-minhash/blob/master/lib/minhash.js。
在该函数中,我将文本分成多个标记(或k个带状疱疹),然后为每个标记/带状疱计算哈希值:
crc32.str(token) & 0xffffffff
所以我要为不同的令牌部分取回一组数字。示例:
The quick brown fox jumps over the lazy dog. == -3677935418
The quick brown fox jumps over the lazy dog. == -3191969143
The quick brown fox jumps over the lazy dog. == -2264094193
The quick brown fox jumps over the lazy dog. == -4003895775
The quick brown fox jumps over the lazy dog. == -4077650760
The quick brown fox jumps over the lazy dog. == -3917776217
The quick brown fox jumps over the lazy. == -3677935418
The quick brown fox jumps over the lazy. == -3191969143
The quick brown fox jumps over the lazy. == -2264094193
The quick brown fox jumps over the lazy. == -4003895775
The quick brown fox jumps over the lazy. == -4077650760
The quick brown fox jumps over the lazy. == -2302592728
M1 length: 6
M2 length: 6
Minhash1: -3677935418 Minhash2: -3677935418
Minhash1: -3191969143 Minhash2: -3191969143
Minhash1: -2264094193 Minhash2: -2264094193
Minhash1: -4003895775 Minhash2: -4003895775
Minhash1: -4077650760 Minhash2: -4077650760
shared: 5 total: 6
Shared/Total: 0.8333333333333334
相互比较,匹配的哈希数非常相似。在此示例中,排列数为6。
现在我的问题是,我想知道如何为该文本创建单个特征哈希字符串,因为所有模块仅将文本彼此直接进行比较并输出相似系数。对于相似的文本/文档,哈希字符串也应相似。示例(类似这样):
The quick brown fox jumps over the lazy dog
哈希:KV5rsUfZpcZdVojpG8mHLA==
The quick brown fox jumps over the lazy
哈希:KV5rsUfZpcZdVojpG8hTPS==
是否有可能在单个令牌哈希中创建一个标识哈希字符串?并将它们编码为十六进制字符串或类似字符串?
编辑:我知道有类似MongoDB Object_ID的东西,它是由3个字段构成的唯一十六进制字符串:
a 4-byte value representing the seconds since the Unix epoch,
a 5-byte random value, and
a 3-byte counter, starting with a random value.
https://docs.mongodb.com/manual/reference/method/ObjectId/
用Token-Array做类似的事情会很好...但是我不知道如何:(
编辑:我从数字标记创建了一个十六进制字符串标记,并将它们串联在一起:
function convertToHex(numberArray) {
if (Array.isArray(numberArray)) {
return numberArray.map((number) => {
if (number < 0)
{
number = 0xFFFFFFFF + number + 1;
}
return number.toString(16).toUpperCase();
// number = number >>> 0;
// return pa
});
} else {
return null;
}
}
这样,对于相似的文档,我得到了类似的字符十六进制字符串,但是随着排列的数量,令牌的数量也变长了..并且不再可比较十六进制字符串。
因此,我尝试将长的十六进制字符串分解为较小的令牌,但在这里我遇到了同样的问题,匹配令牌的数量将减少,因为正在创建差异较小的较大令牌,从而降低了总体相似度。结束...