Question

给出类似

的字符串

"<p> >this line starts with an arrow <br /> this line does not </p>"

或

"<p> >this line starts with an arrow </p> <p> this line does not </p>"

如何找到以箭头开头的行并用div

包围它们

这样就变成了：

"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>

Answer 1

由于它是您正在解析的HTML，因此请使用该工具进行工作 - 一个HTML解析器，如BeautifulSoup。

使用find_all()查找以>开头的所有文本节点，并使用新的div标记wrap()：

from bs4 import BeautifulSoup

data = "<p> >this line starts with an arrow <br /> this line does not </p>"

soup = BeautifulSoup(data)
for item in soup.find_all(text=lambda x: x.strip().startswith('>')):
    item.wrap(soup.new_tag('div'))

print soup.prettify()

打印：

<p>
    <div>
    >this line starts with an arrow
    </div>
    <br/>
    this line does not
</p>

Answer 2

您可以尝试使用>\s+(>.*?)<正则表达式模式。

import re
regex = re.compile("\\>\\s{1,}(\\>.{0,}?)\\<")
testString = "" # fill this in
matchArray = regex.findall(testString)
# the matchArray variable contains the list of matches

并将匹配的组替换为<div> matched_group </div>。此处模式查找> >和<中包含的任何内容。

以下是debuggex

上的演示

Answer 3

你可以试试这个正则表达式，

>(\w[^<]*)

DEMO

Python代码将是，

>>> import re
>>> str = '"<p> >this line starts with an arrow <br /> this line does not </p>"'
>>> m = re.sub(r'>(\w[^<]*)', r"<div> >\1</div> ", str)
>>> m
'"<p> <div> >this line starts with an arrow </div> <br /> this line does not </p>"'

如何在某些标签之间获取文本和替换文本

3 个答案: