我们正尝试使用局部敏感算法实现问题相似性检测。我们正在使用lshash python包。
我们的目标是实现类似的"问题建议如何适用于Stackvoerflow"
以下是我们的示例数据文本文件。
The food didn't taste very good, and actually I don't feel very well now
He can pull strings for you
I saw him???
The blue SUV in front of the Honda
I gave my seat to the old lady
Susan spent the summer vacation at her grandmother's
Do you want anything to eat?
A water molecule has two hydrogen atoms and one oxygen atom
He's away on business
Are you here for work?
I had a strange dream last night
The boy began to cry
She pointed her finger at him
No matter who says so, it's not true
May I have a receipt?
She loves him
Where is the nearest bank?
Tired from the hard work, he went to bed earlier than usual
He has not written to them for a long time
Do you have any brothers?
I have to buy a new pair of skis
Winter is my favorite season
Why did this happen?
Tom seems very happy
It was cold, so we lit a fire
I look forward to my birthday
She attacked him with a baseball bat
You're a really good cook
That's too much
I expect a subway station will be here in the future
what is photosynthesis?
what is mathematics?
do you know about photosynthesis?
以下是python代码
from lshash import LSHash
from nltk.corpus import stopwords
#CONSTANTS
HASH_SIZE = 16
INPUT_DIMENSION = 50
NUM_HASHTABLES = 20
INPUT_FILE = 'test-cases.txt'
lsh = LSHash(HASH_SIZE,INPUT_DIMENSION,NUM_HASHTABLES)
cachedStopWords = stopwords.words("english")
dict_questions = {}
dict_no_stop_questions = {}
dict_ascii_questions = {}
def remove_stop(text):
return ' '.join([word for word in text.split() if word not in cachedStopWords])
def remove_special_chars(text):
return ''.join(e for e in text if (e.isalnum() or e.isspace()))
def append_dummy(arr):
if len(arr)<INPUT_DIMENSION:
for x in range(INPUT_DIMENSION-len(arr)):
arr.append(0)
def get_original_form(search_item):
f_key = -1
for key, value in dict_ascii_questions.iteritems():
if value[:INPUT_DIMENSION] == list(search_item[0]):
f_key = key
break
if f_key!=-1:
return dict_questions[f_key] + " # " +dict_no_stop_questions[f_key]
else:
return ""
file = open(INPUT_FILE, 'r')
questions = file.readlines()
index = 0
for question in questions:
dict_questions[index] = question;
dict_no_stop_questions[index] = remove_stop(remove_special_chars(question.lower()))
value = [ord(c) for c in dict_no_stop_questions[index]]
if len(value)<INPUT_DIMENSION:
append_dummy(value)
dict_ascii_questions[index] = value
index = index + 1
for key,value in dict_ascii_questions.iteritems():
lsh.index(value[:INPUT_DIMENSION])
query = raw_input("Type n for exit. Input Query? =")
while query!="n":
aq = [ord(c) for c in remove_stop(remove_special_chars(query.lower()))]
append_dummy(aq)
results = lsh.query(aq[:INPUT_DIMENSION],5)
print "Found : " + str(len(results))
for result in results:
print "Rank: " + str(result[1])+ " " + get_original_form(result)
query = raw_input("Type n for exit. Input Query? =")
但是这种实施方式给出了不好的结果。有人可以指导我们在上下文中使用哪种类型的Locality敏感哈希算法? 我对param感到困惑:INPUT_DIMENSION。