我有一个包含重复出现的短语的字符串,或者它甚至可能是一个单词,连续出现多次。
尝试了各种方法,但找不到更节省时间和空间的更好方法。
这是我尝试过的方法
String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
s1 = " ".join([k for k,v in groupby(String.replace("</Sent>","").split())])
s2 = re.sub(r'\b(.+)(\s+\1\b)+', r'\1', String)
在我看来,这两个方法都不起作用
我的预期结果:
what type of people were most likely to be able to be 1.35 ?
这些是我提到的一些帖子
请不要在上面的帖子中重复标记我的问题,因为我尝试了大多数实现,但是找不到有效的解决方案。
答案 0 :(得分:4)
我会采用这种创造性的方法来寻找长度越来越长的副本:
input = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
def combine_words(input,length):
combined_inputs = []
if len(splitted_input)>1:
for i in range(len(input)-1):
combined_inputs.append(input[i]+" "+last_word_of(splitted_input[i+1],length)) #add the last word of the right-neighbour (overlapping) sequence (before it has expanded), which is the next word in the original sentence
return combined_inputs, length+1
def remove_duplicates(input, length):
bool_broke=False #this means we didn't find any duplicates here
for i in range(len(input) - length):
if input[i]==input[i + length]: #found a duplicate piece of sentence!
for j in range(0,length): #remove the overlapping sequences in reverse order
del input[i + length - j]
bool_broke = True
break #break the for loop as the loop length does not matches the length of splitted_input anymore as we removed elements
if bool_broke:
return remove_duplicates(input, length) #if we found a duplicate, look for another duplicate of the same length
return input
def last_word_of(input,length):
splitted = input.split(" ")
if len(splitted)==0:
return input
else:
return splitted[length-1]
#make a list of strings which represent every sequence of word_length adjacent words
splitted_input = input.split(" ")
word_length = 1
splitted_input,word_length = combine_words(splitted_input,word_length)
intermediate_output = False
while len(splitted_input)>1:
splitted_input = remove_duplicates(splitted_input,word_length) #look whether two sequences of length n (with distance n apart) are equal. If so, remove the n overlapping sequences
splitted_input, word_length = combine_words(splitted_input,word_length) #make even bigger sequences
if intermediate_output:
print(splitted_input)
print(word_length)
output = splitted_input[0] #In the end you have a list of length 1, with all possible lengths of repetitive words removed
输出流畅
what type of people were most likely to be able to be 1.35 ?
即使不是期望的输出,我也看不出如何删除早于3个地方的“长度为2”的“是”。
答案 1 :(得分:2)
我很确定使用这种方法的顺序在Python 3.7中得以维持,我不确定旧版本。
String = "what type of people were most likely to be able to be able to be able to be able to be 1.35 ?"
unique_words = dict.fromkeys(String.split())
print(' '.join(unique_words))
>>> what type of people were most likely to be able 1.35 ?