如何使用Locality Sensitive Hashing检测问题的相似性?

时间:2016-02-24 11:51:05

标签: python algorithm information-retrieval locality-sensitive-hash

我们正尝试使用局部敏感算法实现问题相似性检测。我们正在使用lshash python包。

我们的目标是实现类似的"问题建议如何适用于Stackvoerflow"

以下是我们的示例数据文本文件。

    The food didn't taste very good, and actually I don't feel very well now
He can pull strings for you
I saw him???
The blue SUV in front of the Honda
I gave my seat to the old lady
Susan spent the summer vacation at her grandmother's
Do you want anything to eat? 
A water molecule has two hydrogen atoms and one oxygen atom
He's away on business
Are you here for work? 
I had a strange dream last night
The boy began to cry
She pointed her finger at him
No matter who says so, it's not true
May I have a receipt? 
She loves him
Where is the nearest bank? 
Tired from the hard work, he went to bed earlier than usual
He has not written to them for a long time
Do you have any brothers? 
I have to buy a new pair of skis
Winter is my favorite season
Why did this happen? 
Tom seems very happy
It was cold, so we lit a fire
I look forward to my birthday
She attacked him with a baseball bat
You're a really good cook
That's too much
I expect a subway station will be here in the future
what is photosynthesis?
what is mathematics?
do you know about photosynthesis?

以下是python代码

    from lshash import LSHash
from nltk.corpus import stopwords

#CONSTANTS
HASH_SIZE = 16
INPUT_DIMENSION = 50
NUM_HASHTABLES = 20
INPUT_FILE = 'test-cases.txt'

lsh = LSHash(HASH_SIZE,INPUT_DIMENSION,NUM_HASHTABLES)
cachedStopWords = stopwords.words("english")
dict_questions = {}
dict_no_stop_questions = {}
dict_ascii_questions = {}
def remove_stop(text):
    return ' '.join([word for word in text.split() if word not in cachedStopWords])
def remove_special_chars(text):
    return ''.join(e for e in text if (e.isalnum() or e.isspace()))
def append_dummy(arr):
    if len(arr)<INPUT_DIMENSION:
        for x in range(INPUT_DIMENSION-len(arr)):
            arr.append(0)

def get_original_form(search_item):
    f_key = -1
    for key, value in dict_ascii_questions.iteritems():
        if value[:INPUT_DIMENSION] == list(search_item[0]):
            f_key = key
            break
    if f_key!=-1:
        return dict_questions[f_key] + " # " +dict_no_stop_questions[f_key]
    else:
        return ""
file = open(INPUT_FILE, 'r')
questions = file.readlines()
index = 0
for question in questions:
    dict_questions[index] = question;
    dict_no_stop_questions[index] = remove_stop(remove_special_chars(question.lower()))
    value = [ord(c) for c in dict_no_stop_questions[index]]
    if len(value)<INPUT_DIMENSION:
        append_dummy(value)
    dict_ascii_questions[index] = value
    index = index + 1
for key,value in dict_ascii_questions.iteritems():
    lsh.index(value[:INPUT_DIMENSION])
query = raw_input("Type n for  exit. Input Query? =")
while query!="n":
    aq = [ord(c) for c in remove_stop(remove_special_chars(query.lower()))]
    append_dummy(aq)
    results = lsh.query(aq[:INPUT_DIMENSION],5)
    print "Found : " + str(len(results))
    for result in results:
        print "Rank: " + str(result[1])+ "  " + get_original_form(result)
    query = raw_input("Type n for  exit. Input Query? =")

但是这种实施方式给出了不好的结果。有人可以指导我们在上下文中使用哪种类型的Locality敏感哈希算法? 我对param感到困惑:INPUT_DIMENSION。

0 个答案:

没有答案