Python re.findall输出超出预期

时间:2015-07-14 04:10:34

标签: python regex

我需要一些关于re.findall的帮助。我的输入如下:

<?php
if (getenv('environment') == 'production') {
    $servername = "localhost";
    $username = "production-username";
    $password = "production-password";
    $dbname = "myDB";
} else {
    $servername = "localhost";
    $username = "username";
    $password = "password";
    $dbname = "myDB";
}

// Create connection
$conn = new mysqli($servername, $username, $password, $dbname);
// Check connection
if ($conn->connect_error) {
 die("Connection failed: " . $conn->connect_error);
}
$conn->close();

?> 

re.findall的预期输出应该是(只有C ++注释):

# Python 3.4.2
import re
code = b'''
#include "..\..\src.h"\r
/********************************************//**
 *  ... text
 ***********************************************/
/*!< Detailed description after the member */
int inx = -1l
const char* = "hello, world";
'''
commonP = rb'//.*?$|/\*.*?\*/|\'(?:\\.|[^\\\'])*\'|"(?:\\.|[^\\"])*"'

我用以下re.sub检查了它,它可以删除所有注释:

/********************************************//**
 *  ... text
 ***********************************************/
/*!< Detailed description after the member */

但如果我将re.sub更改为re.findall:

def comment_remover(text):
    def replacer(match):
        s = match.group(0)
        if s.startswith(b'/'):
            return b' ' # note: a space and not an empty string
        else:
            return s
    pattern = re.compile(
        commonP,
        re.DOTALL | re.MULTILINE
    )
    return re.sub(pattern, replacer, text)

new_code = comment_remover(code)
print(new_code)

它给了我超出我想要的输出:

print('=' * 100)
L = re.findall(commonP, code, flags = re.DOTALL | re.MULTILINE)
for item in L:
    print(item)

我在这里做错了什么?

1 个答案:

答案 0 :(得分:1)

您的正则表达式匹配引号括起来的字符串。最后一个替代方案("(?:\\.|[^\\"])*")就是这样做的。 https://regex101.com/r/qM1oK5/1

但是,comment_remover通过检查匹配是否以replacer开头来处理/函数中的问题。

因此,您需要修改表达式或过滤re.findall结果。

In [33]: L = re.findall(commonP, code, flags = re.DOTALL | re.MULTILINE)

In [34]: new_L = [s for s in L if s.startswith('/')]

In [35]: print '\n'.join(new_L)
/********************************************/
/**
 *  ... text
 ***********************************************/
/*!< Detailed description after the member */