在python中使用正则表达式仅匹配未加引号的单词

时间:2018-09-16 21:26:58

标签: python regex python-3.x

在尝试处理某些代码时,我需要查找使用特定列表中的变量的实例。问题是,代码被混淆了,那些变量名也可能出现在字符串中,例如,我不想匹配。

但是,我无法找到一个正则表达式来仅匹配在python中有效的非引号单词。

1 个答案:

答案 0 :(得分:0)

"[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"

应将所有未加引号的单词与最后一组(第六组,索引5,基于0的索引)进行匹配。为了避免匹配以引号开头的字符串,需要进行一些小的修改。

说明:

[^\\\\] Match any character but an escape character. Escaped quotes do not start a string.
((\")|(')) Immediately after the non-escaped character, match either " or ', which starts a string. This is group 1, which contains groups 2 (\") and 3 (')
(?(2) if we matched group 2 (a double-quote)
    ([^\"]|\\\")*| match anything but double quotes, or match escaped double quotes. Otherwise:
    ([^']|\\')*) match anything but a single quote or match an escaped single quote.
        If you wish to retrieve the string inside the quotes, you will have to add another group: (([^\"]|\\\")*) will allow you to retrieve the whole consumed string, rather than just the last matched character.
        Note that the last character of a quoted string will actually be consumed by the last [^\\\\]. To retrieve it, you have to turn it into a group: ([^\\\\]). Additionally, The first character before the quote will also be consumed by [^\\\\], which might be meaningful in cases such as r"Raw\text".
[^\\\\]\\1 will match any non-escape character followed by what the first group matched again. That is, if ((\")|(')) matched a double quote, we requite a double quote to end the string. Otherwise, it matched a single quote, which is what we require to end the string.
|(\w+) will match any word. This will only match if non-quoted strings, as quoted strings will be consumed by the previous regex.

例如:

import re
non_quoted_words = "[^\\\\]((\")|('))(?(2)([^\"]|\\\")*|([^']|\\')*)[^\\\\]\\1|(\w+)"
quote = "This \"is an example ' \\\" of \" some 'text \\\" like wtf' \\\" is what I said."
print(quote)
print(re.findall(non_quoted_words,quote))

将返回:

This "is an example ' \" of " some 'text \" like wtf' \" is what I said.
[('', '', '', '', '', 'This'), ('"', '"', '', 'f', '', ''), ('', '', '', '', '', 'some'), ("'", '', "'", '', 't', ''), ('', '', '', '', '', 'is'), ('', '', '', '', '', 'what'), ('', '', '', '', '', 'I'), ('', '', '', '', '', 'said')]