Question

有没有人知道如何在python中使用正则表达式来获取引号之间的所有内容？

例如，文字：“这里有些文字”......文字：“这里有更多文字！”......文字：“和一些数字 - 2343-这里也是”

文本长度不同，有些还包含标点符号和数字。如何编写正则表达式来提取所有信息？

我希望在编译器中看到：

这里有一些文字这里有更多文字和一些数字 - 2343 - 这里也是

Answer 1

这应该适合你：

"(.*?)"

在?之后放置*将限制它尽可能少地匹配，因此它不会消耗任何引号。

>>> r = '"(.*?)"'
>>> s =  'text: "some text here".... text: "more text in here!"... text:"and some numbers - 2343- here too"'
>>> import re
>>> re.findall(r, s)
['some text here', 'more text in here!', 'and some numbers - 2343- here too']

Answer 2

尝试"[^"]*"，"后跟零个或多个不是"的项目，后跟"。所以：

pat = re.compile(r'"[^"]*"').

Answer 3

如果要匹配的引用子字符串不包含转义字符，那么Karl Barker和Pierce的答案都将正确匹配。然而，在这两者中，皮尔斯的表达效率更高：

reobj = re.compile(r"""
    # Match double quoted substring (no escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"]*             # Zero or more non-".
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.VERBOSE)

但是如果要匹配的引用子字符串包含转义字符，（例如“她说：”对我来说是“\”。\ n“），那么你需要一个不同的表达式：

reobj = re.compile(r"""
    # Match double quoted substring (allow escaped chars).
    "                   # Match opening quote.
    (                   # $1: Quoted substring contents.
      [^"\\]*           # {normal} Zero or more non-", non-\.
      (?:               # Begin {(special normal*)*} construct.
        \\.             # {special} Escaped anything.
        [^"\\]*         # more {normal} Zero or more non-", non-\.
      )*                # End {(special normal*)*} construct.
    )                   # End $1: Quoted substring contents.
    "                   # Match closing quote.
    """, re.DOTALL | re.VERBOSE)

我知道有几个表达式可以解决问题，但上面的一个表达式（取自MRE3）是最有效的。请参阅my answer to a similar question，其中比较了各种功能相同的表达式。

如何编写正则表达式来获取引用中的所有内容

3 个答案: