如何有效地删除字符串中的连续重复单词或短语

时间:2019-08-09 06:41:55

标签: python python-3.x string

我有一个包含重复出现的短语的字符串,或者它甚至可能是一个单词,连续出现多次。

尝试了各种方法,但找不到更节省时间和空间的更好方法。

这是我尝试过的方法

  1. groupby()
  2. re
String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
s1 = " ".join([k for k,v in groupby(String.replace("</Sent>","").split())])
s2 = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', String)

在我看来,这两个方法都不起作用

我的预期结果:

what type of people were most likely to be able to be 1.35 ?

这些是我提到的一些帖子

  1. Is there a way to remove duplicate and continuous words/phrases in a string?-不起作用
  2. How can I remove duplicate words in a string with Python?-可以部分工作,但也需要一种用于大字符串的最佳方式

请不要在上面的帖子中重复标记我的问题,因为我尝试了大多数实现,但是找不到有效的解决方案。

2 个答案:

答案 0 :(得分:4)

我会采用这种创造性的方法来寻找长度越来越长的副本:

input = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
def combine_words(input,length):
    combined_inputs = []
    if len(splitted_input)>1:
        for i in range(len(input)-1):
            combined_inputs.append(input[i]+" "+last_word_of(splitted_input[i+1],length)) #add the last word of the right-neighbour (overlapping) sequence (before it has expanded), which is the next word in the original sentence
    return combined_inputs, length+1

def remove_duplicates(input, length):
    bool_broke=False #this means we didn't find any duplicates here
    for i in range(len(input) - length):
        if input[i]==input[i + length]: #found a duplicate piece of sentence!
            for j in range(0,length): #remove the overlapping sequences in reverse order
                del input[i + length - j]
            bool_broke = True
            break #break the for loop as the loop length does not matches the length of splitted_input anymore as we removed elements
    if bool_broke:
        return remove_duplicates(input, length) #if we found a duplicate, look for another duplicate of the same length
    return input

def last_word_of(input,length):
    splitted = input.split(" ")
    if len(splitted)==0:
        return input
    else:
        return splitted[length-1]

#make a list of strings which represent every sequence of word_length adjacent words
splitted_input = input.split(" ")
word_length = 1
splitted_input,word_length = combine_words(splitted_input,word_length)

intermediate_output = False

while len(splitted_input)>1:
    splitted_input = remove_duplicates(splitted_input,word_length) #look whether two sequences of length n (with distance n apart) are equal. If so, remove the n overlapping sequences
    splitted_input, word_length = combine_words(splitted_input,word_length) #make even bigger sequences
    if intermediate_output:
        print(splitted_input)
        print(word_length)
output = splitted_input[0] #In the end you have a list of length 1, with all possible lengths of repetitive words removed

输出流畅

what type of people were most likely to be able to be 1.35 ?

即使不是期望的输出,我也看不出如何删除早于3个地方的“长度为2”的“是”。

答案 1 :(得分:2)

我很确定使用这种方法的顺序在Python 3.7中得以维持,我不确定旧版本。

String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
unique_words = dict.fromkeys(String.split())
print(' '.join(unique_words))
>>> what type of people were most likely to be able 1.35 ?