Regex to Match String Syntax in Code

时间:2018-04-18 18:00:25

标签: python regex escaping syntax-highlighting

Let's say, for example, there is a Python source code file like:

def someStuff():
  return "blabla"

myThing = "Bob told me: \"Hello there!\""

twoStrings = "first part " + "second part"

How would I write a regular expression to match:

"blabla", "Bob told me: \"Hello there!\"", "first part ", & "second part"

including the surrounding quotes?


Originally, I figured this could be done simply with \"[^\"]*\" but this fails to take into account cases where the string contains a \". I've tried incorporating negative look-behinds also:

(?<!\\)\"[^\"]*(?<!\\)\"

but have not had any success. What would be the recommended way to handle this?

2 个答案:

答案 0 :(得分:1)

此正则表达式(使用单行修饰符s)应匹配所有类型的字符串文字:

([bruf]*)("""|'''|"(?!")|'(?!'))(?:(?!\2)(?:\\.|[^\\]))*\2

这支持三引号字符串,转义序列,它还会捕获rufb等前缀。请参阅online demo

需要使用单行修饰符s来正确匹配多行字符串。此外,启用i修饰符可使其与R'nobody uses capitalized prefixes anyways'等大写前缀相匹配。

据我所知,有两点需要注意:

  1. 它还匹配字节文字。
  2. 它匹配评论中的字符串文字。
  3. 正则表达式的解释:

    ([bruf]*)                # match and capture any prefix characters
    ("""|'''|"(?!")|'(?!'))  # match the opening quote
    (?:                      # as many times as possible...
        (?!\2)               # ...as long as there's no closing quote... 
        (?:                  # ...match either...
            \\.              # ...a backslash and the character after it
        |                    # ...or...
            [^\\]            # ...a single non-backslash character
        )        
    )*
    \2                       # match the closing quote
    

答案 1 :(得分:0)

Use negative look behind:

".*?(?<!\\)"

This uses a lazy quantifier (*?) to match until the next quote (") as long the quote is not escaped by a backslash (\"). Compare with the simpler (but erroneous) regex ".*?"