Question

I'm trying to find similar string pairs in a large string set. I have around 100 million strings, and string similarity is measured as edit distance. For example, "this is a sentence" and "this is also a sentence" are similar.

It is not practical to calculate the similarity between each two strings, resulting 100M x 100M calculations. I'm considering a divide-and-conquer strategy to first group strings into "roughly similar" subsets, and then calculate each string pair within a subset. For example, say I have the following 5 strings,

str1 = "this is a sentence"
str2 = "this is also a sentence"
str3 = "Mary loves elephants"
str4 = "Mary loves an elephant"
str5 = "Mark loves elephants"

I hope to have a subset [str1, str2] and another subset [str3, str4, str5]. Then I'll compare str1 and str2 to see if they are similar. I'll also compare str3, str4, str5 to find a similar pair. The total calculations will be reduced from C^2_5=10 to C^2_2+C^2_3=4.

The dividing needs to be fast, and therefore does not need to be accurate. Subsets can be overlapping. And it is acceptable if occasionally a string's similar pair is not included in the same subset, -- then I'll scan a near subset.

I was trying to find an order-preserved hash method to roughly map strings to integers (collision does not matter), and check each string against candidate strings with close integers. But I fail to find such an algorithm.

I'm using Python, and I'll be willing to try if a solution is only applicable in another programming language.

Thank you very much.

Answer 1

排序时可以使用Levenshtein距离作为关键函数。

import requests
import Levenshtein as L

def download_moby_dick():
    moby_dick_url = 'https://www.gutenberg.org/files/2701/2701-0.txt'
    return requests.get(moby_dick_url).text

def sentences_in_book(book):
    sentences = (s for s in re.split(r'[.;?!]\s|\r\n\r\n', moby_dick))
    sentences = (re.sub('\s+', ' ', s).strip() for s in sentences)
    sentences = (s for s in sentences if len(s) > 10)
    return list(sentences)

sentences = sentences_in_book(download_moby_dick())

# sort by length
sentences.sort(key=len)

# median length sentence
target = sentences[len(sentences)//2]

# sort by levenshtein distance to target
def keyfunc(sentence):
    return L.distance(target, sentence) 

sentences.sort(key=keyfunc)

这将给出一个粗略的顺序，其中类似的句子被组合在一起。为了加快速度，您可能需要进一步分解任务。例如，只使用每个单词中的一些字母来缩写输入句子，只搜索长度大致相同的句子等。

large-scale string comparison

1 个答案: