Question

我试图在HTML文件集中添加一行。

我想把它放在</h1>和<p>标签之间，所以我试图找到能够捕捉这些标签之间所有内容的正则表达式（可能有新行，空格或者什么也没有）然后用我之前准备的html_line替换它

到目前为止，我有这个：

for i in filesToBeChanged:
    lines = codecs.open(i,'r','utf-8').readlines()
    for line in lines:
        if line.find('</h1>') != -1: #here I probably need some .replace() :)
            print line

Answer 1

您可以将following regex与re.sub：

一起使用

(?s)<\/h1>(.*?)<p>

(?s)启用单线模式，以便.匹配换行符号。

示例代码：

import re
p = re.compile(ur'(?s)<\/h1>(.*?)<p>')
test_str = u"I want to put it between the </h1> and\nand <p> tags,"
subst = u"</h1>\1\n<tag att=\"va\">NEW TEXT</tag>\n<p>"
result = re.sub(p, subst, test_str)

Answer 2

最好使用beautifulSoup或lxml进行html处理。

这样的事情：

from bs4 import BeautifulSoup

html_doc = """
<h1>First header</h1>
<p>first paragraph</p>
<h1>Second header</h1>
<p>second paragraph</p>
<h3>Third header</h3>
"""

soup = BeautifulSoup(html_doc)
for h1 in soup.findAll('h1'):
    if h1.find_next_sibling('p'):
        h1.insert_after('\nSome text')
print soup

输出：

<h1>First header</h1>
Some text
<p>first paragraph</p>
<h1>Second header</h1>
Some text
<p>second paragraph</p>
<h3>Third header</h3>

Answer 3

如果您可以使用前瞻和回顾，这应该有效：

(?<=\<\/h1\>)[\S\s]*(?=\<p\>)

找到＆amp;使用python替换html标签

3 个答案: