我想将输入字符串与元组列表匹配,并从元组列表中找出前N个最接近的匹配项。元组的列表有大约2000个项目。我面临的问题是我使用了fuzzywuzzy process.extract method
,但是它返回了大量具有相同置信度得分的元组。比赛的质量也不好。我想做的是根据我的输入获取所有匹配项(顺序不重要)
Example:
input string: 'fruit apple'
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]
在这里,我想从字符串列表中找到所有包含任何顺序的“水果苹果”一词的字符串。
Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]
我知道Fuzzywuzzy是1行代码,但是问题是当要检查的元组列表的大小很大时,fuzzywuzzy会将相同的置信度得分分配给不相关的项目。
到目前为止尝试的附加代码仅供参考:
def preprocessing(fruit):
stop_words = stopwords.words('english')
fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile:
csvreader = csv.reader(csvfile)
fields = next(csvreader)
for row in csvreader:
nrows.append(row)
flat_list = [item for items in nrows for item in items]
def get_matching_fruits(input_raw_text):
preprocessed_synonym = preprocessing(input_raw_text)
text = nltk.word_tokenize(preprocessed_synonym)
pos_tagged = nltk.pos_tag(text)
nn = filter(lambda x:x[1]=='NN',pos_tagged)
list_nn = list(nn)
nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
list_nnp = list(nnp)
nns = filter(lambda x:x[1]=='NNS',pos_tagged)
list_nns = list(nns)
comb_nouns = list_nn + list_nnp + list_nns
input_nouns = [i[0] for i in comb_nouns]
input_nouns= ' '.join(input_nouns)
ratios = process.extract(input_nouns, flat_list, limit=1000)
result = []
for i in ratios:
if input_nouns in i[0]:
result.append(i)
return result
get_matching_fruits('blue shaped pear was found today')
因此,在我的代码中,我想让result list
包含给定任何有问题的输入的所有可能的匹配项。我们对此表示任何帮助。
答案 0 :(得分:1)
对我来说最简单的解决方案是这个。
foo = 'fruit apple'
bar = [('apple fruit', 91),
('the fruit is an apple', 34),
('banana apple', 78),
('guava tree', 11),
('delicious apple', 88)]
matches = []
for entry in bar:
for word in foo.split():
# break if we meet a point where the word isn't found
if word not in entry[0]:
break
# the else is met if we didn't break from the for loop
else:
matches.append(entry)
print(matches)
答案 1 :(得分:1)
对不起,如果我有点没办法正确理解这个问题,但是为什么您甚至需要一个NLTK库来执行此操作。这是一个简单的列表理解问题
this.store.select(userFeature.getPassword).subscribe(
pass => {
this.password = pass;
this.cd.detectChanges();
});