在字符串中查找单词序列

时间:2021-05-06 13:44:18

标签: python python-3.x string list

没有看到有人在不使用正则表达式的情况下解决了类似的问题。所以我有一个文本 =“我有三个苹果,因为昨天我买了三个苹果”和关键词列表:单词 = ['我有','三个','苹果','昨天'] 和 k = 2。我写了一个函数,该函数查找并返回 >= k 个“单词”的单词序列(单词是指在列表中标识为单个元素的单词组合,例如“我有”被认为是一个单词)。

在这种情况下,它应该返回 ['我有三个苹果','三个苹果']。即使 'yesterday' 在字符串中,它也是 < k,所以它不匹配。

我假设需要有一个堆栈来跟踪序列的大小。我开始编写代码时没有 1) 在这种情况下它不起作用,因为我尝试检查“我有”,然后“我有三个”等,但它无法识别“三个苹果”; 2)我不知道如何继续。代码如下:

text = "I have three apples because yesterday I bought three apples"
words = ['I have', 'three', 'apples', 'yesterday']
k = 2

check1 = []


def search(text, words, k):
    for i in words:
        finding = text.count(i)
        if finding != 0:
            check1.append(i)
            check2 = ' '.join(check1)
            
            occurrences = text.count(check2)
            if occurrences > 0:
                #i want to check if the previous number of occurrences was the same 
                #that's why I think I need a stack. if it is, i keep going
                #if it's not, i append the previous phrases to the list if they're >= k and keep 
                #checking
                pass
            else:
                #the next word doesn't belong to the sequence, so we finish the process
                #by adding the right number of word sequences >= k to the resulting list
                pass
        else:
            #the word is not in the list and I need to add check2 to the list 
            #considering all word sequences
            pass

非常感谢解决此问题的不同方法或任何想法,因为我一直试图以这种方式解决它,但我不知道如何实施。

1 个答案:

答案 0 :(得分:2)

我通过浏览文本找到了一个解决方案,并以正确的顺序记下单词。但是,该算法的复杂性随着单词列表的长度和文本长度的增加而迅速增加。根据应用程序的不同,您可能希望采用不同的方式:

def walk(t,w,k):
    t+=' '
    node = -1
    current = []
    collection = []
    while len(t)>1:
        elong = False
        for i in range(len(w)):
            if i > node and t[:len(w[i])] == w[i]:
                    node = i
                    t = t[len(w[i])+1:]
                    current.append(w[i])
                    elong=True
        if not elong or len(t)<2:
            t = t[t.find(' ')+1:]
            if len(current)>=k: collection.append(' '.join(current))
            current = []
            node = -1
    return collection

此函数将处理您在问题中提到的请求,如下所示:

#Input:
print(walk("I have three apples because yesterday I bought three apples",
           ['I have', 'three', 'apples', 'yesterday'],
           2))

#Output:
['I have three apples', 'three apples']

#Input:
print(walk("Is three apples all I have three",
           ['I have', 'three', 'apples', 'yesterday'],
           2))

#Output:
['three apples', 'I have three']

它严重依赖于分隔单词的空格,并且不能很好地处理标点符号。您可能希望包含一些预处理。

相关问题