Question

我正在写一个小的python脚本来从数据库中收集一些数据，唯一的问题是当我从mysql导出数据为XML时，它在XML文件中包含一个\ b字符。我编写代码来删除它，但后来意识到我不需要每次都进行处理，所以我把它放在一个方法中并调用它我在XML文件中找到一个\ b，现在只有正则表达式不匹配，甚至虽然我知道\ b在那里。

这是我正在做的事情：

主程序：

'''Program should start here'''
#test the file to see if processing is needed before parsing
for line in xml_file:
    p = re.compile("\b")
    if(p.match(line)):
        print p.match(line)
        processing = True
        break #only one match needed

if(processing):
    print "preprocess"
    preprocess(xml_file)

预处理方法：

def preprocess(file):
    #exporting from MySQL query browser adds a weird
    #character to the result set, remove it
    #so the XML parser can read the data
    print "in preprocess"
    lines = []
    for line in xml_file:
        lines.append(re.sub("\b", "", line))

    #go to the beginning of the file
    xml_file.seek(0);
    #overwrite with correct data
    for line in lines:
        xml_file.write(line);
    xml_file.truncate()

任何帮助都会很棒，感谢

Answer 1

\b是regular expression engine的标志：

匹配空字符串，但仅匹配单词的开头或结尾。单词被定义为字母数字或下划线字符的序列，因此单词的结尾由空格或非字母数字的非下划线字符表示。请注意，\ b被定义为\ w和\ W之间的边界，因此被视为字母数字的精确字符集取决于UNICODE和LOCALE标志的值。在字符范围内，\ b表示退格符，以便与Python的字符串文字兼容。

所以你需要逃避它以使用正则表达式找到它。

Answer 2

在正则表达式中使用反斜杠来逃避它。由于 Python 中的反斜杠也需要进行转义（除非你使用你不想要的原始字符串），你需要总共3个反斜杠：

p = re.compile("\\\b")

这将生成与\b字符匹配的模式。

Answer 3

如果我错了，请纠正我但不需要使用regEx来替换'\ b'，你只需使用replace方法就可以了：

def preprocess(file):
    #exporting from MySQL query browser adds a weird
    #character to the result set, remove it
    #so the XML parser can read the data
    print "in preprocess"
    lines = map(lambda line: line.replace("\b", ""), xml_file)
    #go to the beginning of the file
    xml_file.seek(0)
    #overwrite with correct data
    for line in lines:
        xml_file.write(line)
    # OR: xml_file.writelines(lines)
    xml_file.truncate()

请注意，python中不需要使用';'在字符串的末尾

正则表达式不匹配

3 个答案: