Question

提前感谢您的帮助。

我有一个字符串列表

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

我需要在每个元素与列表中的所有元素之间进行百分比匹配。例如，我需要先将"hello all"分解为["hello", "all"]，然后我可以看到"hello"位于"hello cat"，因此这将是50％匹配。这是我到目前为止所拥有的，

    hello all   [u'hello', u'hello all', u'hello cat', u'cat hello'] [u'all', u'hello all', u'cat for all', u'dog for all'] 
    cat for all [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'for', u'cat for all', u'dog for all']    [u'all', u'hello all', u'cat for all', u'dog for all']
    dog for all [u'dog', u'dog for all', u'cat dog']    [u'for', u'cat for all', u'dog for all']    [u'all', u'hello all', u'cat for all', u'dog for all']
    cat dog     [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'dog', u'dog for all', u'cat dog']    
    hello cat   [u'hello', u'hello all', u'hello cat', u'cat hello']    [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    
    cat hello   [u'cat', u'cat for all', u'cat dog', u'hello cat', u'cat hello']    [u'hello', u'hello all', u'hello cat', u'cat hello']

正如您所看到的，每个子列表中的第一个单词包含正在搜索的子字符串，后跟包含该子字符串的元素。我能够为一个单词匹配做到这一点，并且我意识到我可以通过简单地取出单个单词之间的交集来继续这个过程，例如。

    cat for all [(cat,for)  [u'cat for all']]   [(for,all)  [u'cat for all', u'dog for all']]

由于我不知道我最长的字符串将会持续多长时间，因此我正在以递归方式执行此操作。此外，还有更好的方法来进行此字符串搜索吗？最终我想找到100％匹配的字符串，因为它是真实的"hello cat" == "cat hello"。我也想找到50％的比赛等等。

我得到的一个想法是使用二叉树，但我怎样才能在python中执行此操作？到目前为止，这是我的代码：

logical_list = []
logical_list_2 = []
logical_list_3 = []
logical_list_4 = []
match_1 = []
match_2 = []
i = 0

logical_name_full = logical_df['Logical'].tolist()
for x in logical_name_full:
    logical_sublist = [x]+x.split()
    logical_list.append(logical_sublist)



for sublist in logical_list:
    logical_list_2.append(sublist[0])
    for split_words in  sublist[1:]:
        match_1.append(split_words)
        for logical_names in logical_name_full:
            if split_words in logical_names:
                match_1.append(logical_names)
        logical_list_2.append(match_1)
        match_1 = []
    logical_list_3.append(logical_list_2)
    logical_list_2 = []

Answer 1

如果我正确理解了这个问题，你有一个字符串列表，你想要找到所述字符串中一个单词的％匹配，百分比由字符串的单词数决定，来自总字数，是字。如果是这样，这个代码示例应该足够了：

for i in full_name_list:
    if word in i.split(" "):
        total_words = len(i.split(" "))
        match_words = 0
        for w in i.split(" "):
            if word == w:
                match_words += 1
        print(i + " Word match: " + str((match_words/total_words)*100) + "%")

对于匹配多字符串，匹配字符串中单词的顺序并不重要： word =＆＃34; test string＆＃34; full_name_list = [＆＃34;测试一些＆＃34;，＆＃34;测试字符串＆＃34;，＆＃34;测试字符串＆＃34;，＆＃34;字符串测试＆＃34;，＆＃34;字符串测试＆＃34;] 结果= []

for i in full_name_list:
    if len([item for item in word if item in i]) > 0:
        total_words = len(i.split(" "))
        match_words = 0.0
        for single_word in word.split(" "):
            for w in i.split(" "):
                if single_word == w:
                    match_words += 1
        results.append(i + "," + str((match_words/total_words)*100) + "%")

with open("file.csv", "w") as f:
    for i in results:
        f.write(i+"\n")

Answer 2

我想我知道你要求的是什么（如果没有，只是评论我的回答，我会尽力帮助）。我写了一个小程序来做我认为你要求的：

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

for i in range(len(full_name_list)):
    full_name_list[i] = full_name_list[i].split(' ')

def match(i, j):
    word = full_name_list[i][j]

    for fullname in full_name_list:
        if full_name_list.index(fullname) == i: continue

        for name in fullname:
            if word == name:
                fullname_str = fullname[0]

                for i in range(1,len(fullname)):
                    fullname_str += ' ' + fullname[i]

                return '"{}" is a {}% match to "{}"'.format(name, int(100/len(fullname)), fullname_str)

print(match(0,1))

您为列表中的名称索引输入两个参数i，并为全名中的名称索引输入j。然后它返回函数与名称匹配的字符串，以及它匹配的程度。它还避免将单词与自身匹配。我在底部运行了一次该功能。它会找到与all中的hello all一词匹配，并且成功。

再次，请告诉我，如果我没有好好回答。它只返回它找到的第一个匹配，但可以很容易地修改它以返回所有匹配。

Answer 3

我做了你要求的改变。你知道，我使用了here得到的子集函数，它从itertools导入（用python内置）。如果这是一个问题，请通知我。

这是新代码。我在底部运行它，这样你就可以看到它在行动中的样子。您在i函数中输入了索引matches，其中i是full_name_list中名称的索引。我相信这就是你要求的一切。

from itertools import chain, combinations

full_name_list = ["hello all","cat for all","dog for all","cat dog","hello cat","cat hello"]

for i in range(len(full_name_list)):
    full_name_list[i] = full_name_list[i].split(' ')

def powerset(iterable):
    s = list(iterable)
    return list(chain.from_iterable(combinations(s, r) for r in range(1, len(s)+1)))


def subset(string, container):  
    if string not in powerset(container): return False

    return True

def makestring(names):
    fullname_str = names[0]

    for i in range(1,len(names)):
        fullname_str += ' ' + names[i]

    return fullname_str

def matches(i):
    results = []

    fullname = full_name_list[i]
    fullnamePS = powerset(fullname)

    for fullname in full_name_list:
        if full_name_list.index(fullname) == i: continue

        for names in fullnamePS:
            if subset(names, fullname): 

                results.append((int(100 * len(names)/len(fullname)), makestring(names), makestring(fullname)))

    return results

for result in matches(1):
    print('"{}" is a {}% match to "{}"'.format(result[1],result[0],result[2]))

如何在Python中递归搜索子字符串？

3 个答案: