没有看到有人在不使用正则表达式的情况下解决了类似的问题。所以我有一个文本 =“我有三个苹果,因为昨天我买了三个苹果”和关键词列表:单词 = ['我有','三个','苹果','昨天'] 和 k = 2。我写了一个函数,该函数查找并返回 >= k 个“单词”的单词序列(单词是指在列表中标识为单个元素的单词组合,例如“我有”被认为是一个单词)。
在这种情况下,它应该返回 ['我有三个苹果','三个苹果']。即使 'yesterday' 在字符串中,它也是 < k,所以它不匹配。
我假设需要有一个堆栈来跟踪序列的大小。我开始编写代码时没有 1) 在这种情况下它不起作用,因为我尝试检查“我有”,然后“我有三个”等,但它无法识别“三个苹果”; 2)我不知道如何继续。代码如下:
text = "I have three apples because yesterday I bought three apples"
words = ['I have', 'three', 'apples', 'yesterday']
k = 2
check1 = []
def search(text, words, k):
for i in words:
finding = text.count(i)
if finding != 0:
check1.append(i)
check2 = ' '.join(check1)
occurrences = text.count(check2)
if occurrences > 0:
#i want to check if the previous number of occurrences was the same
#that's why I think I need a stack. if it is, i keep going
#if it's not, i append the previous phrases to the list if they're >= k and keep
#checking
pass
else:
#the next word doesn't belong to the sequence, so we finish the process
#by adding the right number of word sequences >= k to the resulting list
pass
else:
#the word is not in the list and I need to add check2 to the list
#considering all word sequences
pass
非常感谢解决此问题的不同方法或任何想法,因为我一直试图以这种方式解决它,但我不知道如何实施。
答案 0 :(得分:2)
我通过浏览文本找到了一个解决方案,并以正确的顺序记下单词。但是,该算法的复杂性随着单词列表的长度和文本长度的增加而迅速增加。根据应用程序的不同,您可能希望采用不同的方式:
def walk(t,w,k):
t+=' '
node = -1
current = []
collection = []
while len(t)>1:
elong = False
for i in range(len(w)):
if i > node and t[:len(w[i])] == w[i]:
node = i
t = t[len(w[i])+1:]
current.append(w[i])
elong=True
if not elong or len(t)<2:
t = t[t.find(' ')+1:]
if len(current)>=k: collection.append(' '.join(current))
current = []
node = -1
return collection
此函数将处理您在问题中提到的请求,如下所示:
#Input:
print(walk("I have three apples because yesterday I bought three apples",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['I have three apples', 'three apples']
#Input:
print(walk("Is three apples all I have three",
['I have', 'three', 'apples', 'yesterday'],
2))
#Output:
['three apples', 'I have three']
它严重依赖于分隔单词的空格,并且不能很好地处理标点符号。您可能希望包含一些预处理。