Javascript minhash函数为字符串文本生成特征哈希键

时间:2019-03-20 09:37:01

标签: javascript node.js hash sentence-similarity minhash

我目前正在基于以下模块尝试使用minhash函数算法来测试两个文本(或文档)之间的相似性:https://github.com/sjhorn/node-minhash/blob/master/lib/minhash.js

在该函数中,我将文本分成多个标记(或k个带状疱疹),然后为每个标记/带状疱计算哈希值:

crc32.str(token) & 0xffffffff

所以我要为不同的令牌部分取回一组数字。示例:

The quick brown fox jumps over the lazy dog.  ==  -3677935418
The quick brown fox jumps over the lazy dog.  ==  -3191969143
The quick brown fox jumps over the lazy dog.  ==  -2264094193
The quick brown fox jumps over the lazy dog.  ==  -4003895775
The quick brown fox jumps over the lazy dog.  ==  -4077650760
The quick brown fox jumps over the lazy dog.  ==  -3917776217
The quick brown fox jumps over the lazy.  ==  -3677935418
The quick brown fox jumps over the lazy.  ==  -3191969143
The quick brown fox jumps over the lazy.  ==  -2264094193
The quick brown fox jumps over the lazy.  ==  -4003895775
The quick brown fox jumps over the lazy.  ==  -4077650760
The quick brown fox jumps over the lazy.  ==  -2302592728
M1 length: 6
M2 length: 6
Minhash1: -3677935418  Minhash2: -3677935418
Minhash1: -3191969143  Minhash2: -3191969143
Minhash1: -2264094193  Minhash2: -2264094193
Minhash1: -4003895775  Minhash2: -4003895775
Minhash1: -4077650760  Minhash2: -4077650760
shared: 5 total: 6
Shared/Total: 0.8333333333333334

相互比较,匹配的哈希数非常相似。在此示例中,排列数为6。

现在我的问题是,我想知道如何为该文本创建单个特征哈希字符串,因为所有模块仅将文本彼此直接进行比较并输出相似系数。对于相似的文本/文档,哈希字符串也应相似。示例(类似这样):

The quick brown fox jumps over the lazy dog

哈希:KV5rsUfZpcZdVojpG8mHLA==

The quick brown fox jumps over the lazy

哈希:KV5rsUfZpcZdVojpG8hTPS==

是否有可能在单个令牌哈希中创建一个标识哈希字符串?并将它们编码为十六进制字符串或类似字符串?

编辑:我知道有类似MongoDB Object_ID的东西,它是由3个字段构成的唯一十六进制字符串:

a 4-byte value representing the seconds since the Unix epoch,
a 5-byte random value, and
a 3-byte counter, starting with a random value.

https://docs.mongodb.com/manual/reference/method/ObjectId/

用Token-Array做类似的事情会很好...但是我不知道如何:(

编辑:我从数字标记创建了一个十六进制字符串标记,并将它们串联在一起:

function convertToHex(numberArray) {
    if (Array.isArray(numberArray)) {
        return numberArray.map((number) => {
            if (number < 0)
            {
              number = 0xFFFFFFFF + number + 1;
            }

            return number.toString(16).toUpperCase();
            // number = number >>> 0;
            // return pa
        });
    } else {
        return null;
    }        
}

这样,对于相似的文档,我得到了类似的字符十六进制字符串,但是随着排列的数量,令牌的数量也变长了..并且不再可比较十六进制字符串。

因此,我尝试将长的十六进制字符串分解为较小的令牌,但在这里我遇到了同样的问题,匹配令牌的数量将减少,因为正在创建差异较小的较大令牌,从而降低了总体相似度。结束...

0 个答案:

没有答案