re.sub无法执行 - 即使找到了正则表达式模式?

时间:2014-06-30 12:48:10

标签: python html regex replace html-parsing

考虑一下这个我在Python 2.7上运行的例子:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

tstr = r'''    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span><a
 id="Xtester"></a><span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>

'''

# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
print(  re.findall(regstr, tout2, re.DOTALL))             # finds
print("------") #
print(      re.sub(regstr, "AAAAAAA", tout2, re.DOTALL )) # does nothing?

当我运行时 - 第一个正则表达式被替换/ sub&#39; d如预期的那样(已经消失);然后在输出中我得到:

[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]

...这意味着正确编写了第二个正则表达式(找到了所有三个部分) - 但是,当我尝试用&#34; AAAAAAA&#34;替换所有该片段时 - 输出的那部分没有任何反应:

------
    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" class="url" ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>
显然,没有&#34; AAAAAAA&#34;在这里,正如我所期待的那样。

问题是什么,我该怎么做才能让sub替换显然已找到的匹配?

5 个答案:

答案 0 :(得分:2)

为什么不使用HTML解析器来解析和修改HTML

示例,使用BeautifulSoupreplace_with()

from bs4 import BeautifulSoup

data = """Your html here"""
soup = BeautifulSoup(data)

for link in soup('a', id=True):
    link.replace_with('AAAAAA')

print(soup.prettify())

这将替换id属性为AAAAAA的所有链接:

<div class="thebibliography">
<p class="bibitem">
<span class="biblabel">
 [1]
 <span class="bibsp">
 </span>
</span>
AAAAAA
<span class="cmcsc-10">
...

另见:

答案 1 :(得分:1)

由于滥用re.sub方法,您的替代品无法正常工作,如果您查看文档:

re.sub(pattern, repl, string, count=0, flags=0)

但是在你的代码中,你放了&#34;标志&#34;在&#34;计数&#34;地点。这就是为什么re.DOTALL标志被忽略的原因,因为它位于错误的位置。

由于您不需要使用计数参数,因此您可以删除re.DOTALL标记并改为使用内联修饰符:

regstr = r'''(?s)(<a.*?)(class=['"].*?['"])([\s]*>)'''

然而,使用像bs4这样的东西可能更方便。 (正如你在@alecxe中看到的那样)。

答案 2 :(得分:1)

非常简单:Python标准库参考说语法或re.sub是:re.sub(pattern, repl, string, count=0, flags=0)。所以你的最后一个子实际上是(re.DOTALL == 16):

re.sub(regstr, "AAAAAAA", tout2, count = 16, flags = 0 )

当您需要时:

re.sub(regstr, "AAAAAAA", tout2, flags = re.DOTALL )

并且最后一个子工作完美...

答案 3 :(得分:1)

问题是 - 您的论据错误

Python 2.7来源:

def re.sub(pattern, repl, string, count=0, flags=0):
     //code

在这里,您的参数re.DOTALL被视为计数参数。

FIX:使用re.sub(regstr, "AAAAAAA", tout2, flags=re.DOTALL )代替

注意:如果您尝试将编译与正则表达式一起使用,则子工作正常。

答案 4 :(得分:0)

嗯,显然,在这种情况下,我应该使用已编译的正则表达式对象(而不是直接通过re.模块调用),并且一切似乎都有效(甚至可以使用反向引用) - 但我仍然不喜欢不明白为什么会出现这个问题?很高兴知道为什么最终...无论如何,这是更正的代码片段:

# remove <a id>
tout2 = re.sub(r'''<a[\s]*?id=['"].*?['"][\s]*?></a>''', " ", tstr, re.DOTALL)
# remove class= in <a
regstr = r'''(<a.*?)(class=['"].*?['"])([\s]*>)'''
pat = re.compile(regstr, re.DOTALL)
#~ print(  re.findall(regstr, tout2, re.DOTALL))             # finds
print(  pat.findall(tout2))             # finds
print("------") #
# re.purge() # no need
print(      pat.sub(r'\1AAAAAAA\3', tout2, re.DOTALL )) # does nothing?

......这是输出:

[('<a\nhref="http://www.example.com/test.html" ', 'class="url"', ' >')]
------
    <div class="thebibliography">
   <p class="bibitem" ><span class="biblabel">
 [1]<span class="bibsp">   </span></span> <span
class="cmcsc-10">A<span
class="small-caps">k</span><span
class="small-caps">e</span><span
class="small-caps">g</span><span
class="small-caps">c</span><span
class="small-caps">t</span><span
class="small-caps">o</span><span
class="small-caps">r</span>,</span>
   <span
class="cmcsc-10">P. D.</span><span
class="cmcsc-10"> H.  </span> testöng ... .  <span
class="cmti-10">Draftin:</span>
   <a
href="http://www.example.com/test.html" AAAAAAA ><span
class="cmitt-10">http://www.example.com/test.html</span></a> (2001).
</p>
   </div>