Question

我有以下句子：

s = "Et puis j'obtiens : [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] [voir écran] Donc, ça veut dire que la suite de nombres réels"

您可以看到[voirécran]经常出现。我只想让它出现。

我尝试过（类似于https://datascience.stackexchange.com/questions/34039/regex-to-remove-repeating-words-in-a-sentence）：

from itertools import groupby


no_dupes = ([k for k, v in groupby(sent_clean.split())])


# Put the list back together into a sentence
groupby_output = ' '.join(no_dupes)
print('No duplicates:', groupby_output)

...但是它不起作用。

Answer 1

You'll need a slightly more complicated regex to identify repeating phrases in brackets:

import re

pat = re.compile(r'(\[[^\]]*\])(?:\s*\1)+')

print(pat.sub(r'\1', s))
# Et puis j'obtiens : [voir écran] Donc, ça veut dire que la suite de nombres réels

(\[[^\]]*\]) captures any number of non ] characters between two brackets, and (?:\s*\1)+ looks for repetitions of that group next to it. We then replace those multiple occurrences of the group with a single occurence.

Answer 2

使用split()也会拆分'[voir ecran]'-您可以手动拆分：

O（n）解决方案一次遍历您的字符串：

# uses line continuation \
s =  "Et puis j'obtiens : [voir écran] [voir écran] [voir écran]" \
    "[voir écran] [voir écran] [voir écran] [voir écran]" \
    "[voir écran] [voir écran] [voir écran] Donc, ça veut" \
    "dire que la suite de nombres réels"

seen = set()
result = []
tmp = []
for c in s:
    if tmp and c == "]":
        tmp.append(c)
        tmp = ''.join(tmp)
        if tmp not in seen:
            result.append(tmp)
            seen.add(tmp)
        tmp = []
    elif tmp:
        tmp.append(c)
    elif not tmp and c == "[":
        tmp.append(c)
    else:
        result.append(c)

if tmp and tmp not in seen:
    result.append(tmp)
    seen.add(tmp)
    tmp = []

s_after = ''.join(result)
print(s_after)

输出：

Et puis j'obtiens : [voir écran]          Donc, ça veut dire que la suite de nombres réels

不会从结果中删除多个空格 -您随后需要执行此操作。

您遍历字符串-将每个字符添加到列表中，直到您按下[。然后，您将所有字符收集到tmp中，直到命中]。您join，并检查您的seen集是否已添加-如果这样做，则不执行任何操作并重置tmp-否则添加并重置tmp。如果您以后遇到相同的[...]，则不会添加。

继续直到结束-如果tmp已填写，请添加它。（可能是其中的'[some rest text no bracked'。

python：删除句子中连续重复的单词

2 个答案: