Question

我需要一个匹配重复（多个）标点符号和符号的正则表达式。基本上所有重复的非字母数字和非空格字符，如...，???，!!!，###，@ @，+ ++等等。它必须是重复的相同字符，所以不是像“！？@”这样的序列。

我曾尝试[^ \ s \ w] +虽然这涵盖了所有!!!，???，$$$案例，但这给了我比我想要的更多，因为它也会匹配“！？@”。

有人可以开导我吗？感谢。

Answer 1

我认为你正在寻找这样的东西：

[run for run, leadchar in re.findall(r'(([^\w\s])\2+)', yourstring)]

示例：

In : teststr = "4spaces    then(*(@^#$&&&&(2((((99999****"

In : [run for run, leadchar in re.findall(r'(([^\w\s])\2+)',teststr)]
Out: ['&&&&', '((((', '****']

这为您提供了一个运行列表，不包括该字符串中的4个空格以及'*（@ ^'

之类的序列

如果这不完全符合您的要求，您可以使用示例字符串编辑您的问题，并准确地输出您希望看到的输出。

Answer 2

尝试这种模式：

([.\?#@+,<>%~`!$^&\(\):;])\1+

\1指的是第一个匹配的组，它是括号的内容。

您需要根据需要扩展标点和符号列表。

Answer 3

编辑：@Firoze Lafeer发布了一个答案，用一个正则表达式完成所有事情。如果有人对将正则表达式与过滤函数结合起来感兴趣，我会留下这个，但是对于这个问题，使用Firoze Lafeer的答案会更简单快捷。

在我看到Firoze Lafeer的答案之前写的答案如下，未改变。

简单的正则表达式不能这样做。经典精辟的总结是“正则表达式无法计算”。在这里讨论：

How to check that a string is a palindrome using regular expressions?

对于Python解决方案，我建议将正则表达式与一些Python代码组合在一起。正则表达式抛出所有不是某种标点符号运行的东西，然后Python代码检查抛出错误匹配（匹配是标点符号但不是所有相同的字符）。

import re
import string

# Character class to match punctuation.  The dash ('-') is special
# in character classes, so put a backslash in front of it to make
# it just a literal dash.
_char_class_punct = "[" + re.escape(string.punctuation) + "]"

# Pattern: a punctuation character followed by one or more punctuation characters.
# Thus, a run of two or more punctuation characters.
_pat_punct_run = re.compile(_char_class_punct + _char_class_punct + '+')

def all_same(seq, basis_case=True):
    itr = iter(seq)
    try:
        first = next(itr)
    except StopIteration:
        return basis_case
    return all(x == first for x in itr)

def find_all_punct_runs(text):
    return [s for s in _pat_punct_run.findall(text) if all_same(s, False)]


# alternate version of find_all_punct_runs() using re.finditer()
def find_all_punct_runs(text):
    return (s for s in (m.group(0) for m in _pat_punct_run.finditer(text)) if all_same(s, False))

我按照我的方式编写了all_same()，以便它在迭代器上和在字符串上一样好用。 Python内置all()为空序列返回True，这不是我们对all_same()的特定用法所需要的，所以我为所需的基本情况做了一个论证默认为True以匹配all()的行为。

这使用Python的内部（正则表达式引擎或all()）尽可能多地完成工作，因此它应该非常快。对于大型输入文本，您可能需要重写find_all_punct_runs()以使用re.finditer()而不是re.findall()。我举了一个例子。该示例还返回生成器表达式而不是列表。你总是可以强迫它列出一个清单：

lst = list(find_all_punct_runs(text))

Answer 4

我就是这样做的：

>>> st='non-whitespace characters such as ..., ???, !!!, ###, @@@, +++ and' 
>>> reg=r'(([.?#@+])\2{2,})'
>>> print [m.group(0) for m in re.finditer(reg,st)]

或

>>> print [g for g,l in re.findall(reg, st)]

任何一个打印：

['...', '???', '###', '@@@', '+++']

用于重复标点符号和符号的Python正则表达式

4 个答案: