查找输入字符串与Python中元组列表的所有可能匹配项(以任何顺序/顺序)

时间:2020-08-13 16:40:27

标签: python python-3.x regex string-matching fuzzywuzzy

我想将输入字符串与元组列表匹配,并从元组列表中找出前N个最接近的匹配项。元组的列表有大约2000个项目。我面临的问题是我使用了fuzzywuzzy process.extract method,但是它返回了大量具有相同置信度得分的元组。比赛的质量也不好。我想做的是根据我的输入获取所有匹配项(顺序不重要)

Example: 
input string: 'fruit apple'
    
List of tuples = [('apple fruit', 91), ('the fruit is an apple', 34), ('banana apple', 78), ('guava tree', 11), ('delicious apple', 88)]

在这里,我想从字符串列表中找到所有包含任何顺序的“水果苹果”一词的字符串。

Expected output:
[('apple fruit', 91), ('the fruit is an apple', 34)]

我知道Fuzzywuzzy是1行代码,但是问题是当要检查的元组列表的大小很大时,fuzzywuzzy会将相同的置信度得分分配给不相关的项目。

到目前为止尝试的附加代码仅供参考:

def preprocessing(fruit):
    stop_words = stopwords.words('english')
    fruit_string = re.sub(r'[a-z][/][a-z][/]*[a-z]{0,1}', '', fruit_string)
    fruit_string = re.sub(r'[^A-Za-z0-9\s]+', '', fruit_string)
    return ' '.join(each_word for each_word in fruit_string.split() if each_word not in stop_words and len(each_word) > 2)
    

#All possible fruit combination list
nrows=[]
with open("D:/fruits.csv", 'r') as csvfile: 
    csvreader = csv.reader(csvfile)
    fields = next(csvreader)
    for row in csvreader: 
        nrows.append(row)
        
flat_list = [item for items in nrows for item in items]        



def get_matching_fruits(input_raw_text):
    preprocessed_synonym = preprocessing(input_raw_text)
    text = nltk.word_tokenize(preprocessed_synonym)
    pos_tagged = nltk.pos_tag(text)
    nn = filter(lambda x:x[1]=='NN',pos_tagged)
    list_nn = list(nn)
    nnp = filter(lambda x:x[1]=='NNP',pos_tagged)
    list_nnp = list(nnp)
    nns = filter(lambda x:x[1]=='NNS',pos_tagged)
    list_nns = list(nns)
    comb_nouns = list_nn + list_nnp + list_nns
    input_nouns = [i[0] for i in comb_nouns]
    input_nouns= ' '.join(input_nouns)
    ratios = process.extract(input_nouns, flat_list, limit=1000)
    result = []    
    for i in ratios:
        if input_nouns in i[0]:
            result.append(i)
    return result    

get_matching_fruits('blue shaped pear was found today')

因此,在我的代码中,我想让result list包含给定任何有问题的输入的所有可能的匹配项。我们对此表示任何帮助。

2 个答案:

答案 0 :(得分:1)

对我来说最简单的解决方案是这个。

foo = 'fruit apple'
bar = [('apple fruit', 91), 
       ('the fruit is an apple', 34), 
       ('banana apple', 78), 
       ('guava tree', 11), 
       ('delicious apple', 88)]

matches = []
for entry in bar:
    for word in foo.split():
        # break if we meet a point where the word isn't found
        if word not in entry[0]:
            break
    # the else is met if we didn't break from the for loop
    else:
        matches.append(entry)

print(matches)

答案 1 :(得分:1)

对不起,如果我有点没办法正确理解这个问题,但是为什么您甚至需要一个NLTK库来执行此操作。这是一个简单的列表理解问题

    this.store.select(userFeature.getPassword).subscribe(
      pass => {
      this.password = pass;
      this.cd.detectChanges();
      });