正则表达式中的Python空白和非破坏空间不匹配

时间:2017-11-20 16:15:36

标签: python

我正在尝试用Python中的正则表达式将日期/时间替换为*符号。挑战在HTML源代码中有一个 字符。我不知道如何用Python来抓住它

我的HTML源代码

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="nl" lang="nl">
  <body leftmargin="15" marginwidth="0" marginheight="0">
    <table summary="" width="97%" cellspacing="0" cellpadding="0">
      <tbody>
        <tr>
          <td colspan="3">U heeft gezocht met</td>
        </tr>
        <tr>
          <td width="20%">Postcode:</td>
          <td colspan="2">9999 ZZ</td>
        </tr>
        <tr><td>Huisnummer:</td>
          <td colspan="2">1</td>
        </tr>
        <tr>
          <td colspan="2"> </td>
          <td class="r">20-11-2017                                  11:51:01</td>
        </tr>        
      </tbody>
    </table>
  </body>
</html>

我的python正则表达式代码

def _fixhtml(self, filename):
    regex_datetime = r'<div align="right">\d{1,2}-\d{1,2}-\d{4} \d{1,2}:\d{2}</div>'
    subst_datetime = '<div align="right">**-**-**** **:**</div>'
    regex_datetime1 = r'<td class="r">\d{1,2}-\d{1,2}-\d{4}\xa0\s+\d{1,2}:\d{2}:\d{2}</td>'
    subst_datetime1 = '<td class="r">**-**-**** **:**:**</td>'

    out_fname = filename + ".tmp"
    with open(filename) as f:
        out = open(out_fname, "w")
        for line in f:
            line = re.sub(regex_datetime, subst_datetime, line)
            line = re.sub(regex_datetime1, subst_datetime1, line)
            out.write(line)
        out.close()
    os.remove(filename)
    os.rename(out_fname, filename)

我尝试了多个组合,比如\ S \ s +和我找到的最后一个组合'捕获'&nbsp;字符,但它不匹配。

0 个答案:

没有答案