有没有办法删除字符串中的重复和连续字词/短语?例如。
[in]: foo foo bar bar foo bar
[out]: foo bar foo bar
我试过这个:
>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> [i for i,j in zip(s.split(),s.split()[1:]) if i!=j]
['this', 'is', 'a', 'foo', 'bar', 'black', 'sheep', ',', 'have', 'you', 'any', 'wool', 'woo', ',', 'yes', 'sir', 'yes', 'sir', 'three', 'bag', 'woo', 'wu']
>>> " ".join([i for i,j in zip(s.split(),s.split()[1:]) if i!=j]+[s.split()[-1]])
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu'
当它变得更复杂并且我想删除短语时会发生什么(让我们说短语最多可以由5个单词组成)?怎么做到呢?例如。
[in]: foo bar foo bar foo bar
[out]: foo bar
另一个例子:
[in]: this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .
[out]: this is a sentence where phrases duplicate . sentence are not prhases .
答案 0 :(得分:13)
您可以使用re模块。
>>> s = 'foo foo bar bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar'
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)\s+\1\b', r'\1', s)
'foo bar foo bar'
如果要匹配任意数量的连续出现次数:
>>> s = 'foo bar foo bar foo bar'
>>> re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
'foo bar'
编辑。最后一个例子的补充。为此,您必须在重复短语时调用re.sub。所以:
>>> s = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
>>> while re.search(r'\b(.+)(\s+\1\b)+', s):
... s = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', s)
...
>>> s
'this is a sentence where phrases duplicate'
答案 1 :(得分:6)
我爱itertools
。似乎每次我想写东西时,itertools都已经拥有它。在这种情况下,groupby
会获取一个列表,并将该列表中重复的连续项分组为(item_value, iterator_of_those_values)
元组。在这里使用它像:
>>> s = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
>>> ' '.join(item[0] for item in groupby(s.split()))
'this is a foo bar black sheep , have you any wool woo , yes sir yes sir three bag woo wu wool'
所以让我们用一个函数扩展一点,该函数返回一个删除了重复重复值的列表:
from itertools import chain, groupby
def dedupe(lst):
return list(chain(*[item[0] for item in groupby(lst)]))
这对于单词短语非常有用,但对于较长的短语没有帮助。该怎么办?好吧,首先,我们要通过跨越我们原来的短语检查更长的短语:
def stride(lst, offset, length):
if offset:
yield lst[:offset]
while True:
yield lst[offset:offset + length]
offset += length
if offset >= len(lst):
return
现在我们正在做饭!好。因此,我们的策略是先删除所有单字重复项。接下来,我们将删除两个字的重复项,从偏移0开始然后是1.之后,三个字重复从偏移0,1和2开始,依此类推,直到我们达到五个字的重复:< / p>
def cleanse(list_of_words, max_phrase_length):
for length in range(1, max_phrase_length + 1):
for offset in range(length):
list_of_words = dedupe(stride(list_of_words, offset, length))
return list_of_words
全部放在一起:
from itertools import chain, groupby
def stride(lst, offset, length):
if offset:
yield lst[:offset]
while True:
yield lst[offset:offset + length]
offset += length
if offset >= len(lst):
return
def dedupe(lst):
return list(chain(*[item[0] for item in groupby(lst)]))
def cleanse(list_of_words, max_phrase_length):
for length in range(1, max_phrase_length + 1):
for offset in range(length):
list_of_words = dedupe(stride(list_of_words, offset, length))
return list_of_words
a = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not prhases .'
b = 'this is a sentence where phrases duplicate . sentence are not prhases .'
print ' '.join(cleanse(a.split(), 5)) == b
答案 2 :(得分:0)
就个人而言,我认为我们不需要使用任何其他模块(虽然我承认其中一些是伟大的)。我只是通过首先将字符串转换为列表来简单循环来管理它。我在上面列出的所有例子中尝试过它。它工作正常。
sentence = str(raw_input("Please enter your sentence:\n"))
word_list = sentence.split()
def check_if_same(i,j): # checks if two sets of lists are the same
global word_list
next = (2*j)-i # this gets the end point for the second of the two lists to compare (it is essentially j + phrase_len)
is_same = False
if word_list[i:j] == word_list[j:next]:
is_same = True
# The line below is just for debugging. Prints lists we are comparing and whether it thinks they are equal or not
#print "Comparing: " + ' '.join(word_list[i:j]) + " " + ''.join(word_list[j:next]) + " " + str(answer)
return is_same
phrase_len = 1
while phrase_len <= int(len(word_list) / 2): # checks the sentence for different phrase lengths
curr_word_index=0
while curr_word_index < len(word_list): # checks all the words of the sentence for the specified phrase length
result = check_if_same(curr_word_index, curr_word_index + phrase_len) # checks similarity
if result == True:
del(word_list[curr_word_index : curr_word_index + phrase_len]) # deletes the repeated phrase
else:
curr_word_index += 1
phrase_len += 1
print "Answer: " + ' '.join(word_list)
答案 3 :(得分:0)
使用类似于sharcashmo模式的模式,您可以在while循环中使用返回替换次数的subn:
import re
txt = r'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate . sentence are not phrases .'
pattern = re.compile(r'(\b\w+(?: \w+)*)(?: \1)+\b')
repl = r'\1'
res = txt
while True:
[res, nbr] = pattern.subn(repl, res)
if (nbr == 0):
break
print res
当没有更多替换时,while
循环停止。
使用此方法,您可以获得所有重叠匹配(在替换上下文中单次传递是不可能的),而不会测试相同模式的两次。
答案 4 :(得分:-1)
txt1 = 'this is a foo bar bar black sheep , have you any any wool woo , yes sir yes sir three bag woo wu wool'
txt2 = 'this is a sentence sentence sentence this is a sentence where phrases phrases duplicate where phrases duplicate'
def remove_duplicates(txt):
result = []
for word in txt.split():
if word not in result:
result.append(word)
return ' '.join(result)
输出:
In [7]: remove_duplicate_words(txt1)
Out[7]: 'this is a foo bar black sheep , have you any wool woo yes sir three bag wu'
In [8]: remove_duplicate_words(txt2)
Out[8]: 'this is a sentence where phrases duplicate'
答案 5 :(得分:-1)
这应修复任意数量的相邻重复项,并适用于您的两个示例。我将字符串转换为列表,修复它,然后转换回字符串输出:
mywords = "foo foo bar bar foo bar"
list = mywords.split()
def remove_adjacent_dups(alist):
result = []
most_recent_elem = None
for e in alist:
if e != most_recent_elem:
result.append(e)
most_recent_elem = e
to_string = ' '.join(result)
return to_string
print remove_adjacent_dups(list)
输出:
foo bar foo bar