Question

我想知道使用lxml和Python用另一个元素包装元素的最简单方法是什么，例如，如果我有一个html片段：

<h1>The cool title</h1>
<p>Something Neat</p>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
<p>The end of the snippet</p>

我想用这样的section元素包装table元素：

<h1>The cool title</h1>
<p>Something Neat</p>
<section>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
</section>
<p>The end of the snippet</p>

我想做的另一件事是在xml文档中搜索具有特定属性的h1s，然后将所有元素包装到元素中的下一个h1标记中，例如：

<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>

转换为：

<section>
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
</section>
<section>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
</section>

感谢所有的帮助，克里斯

Answer 1

lxml对于解析格式良好的xml非常棒，但是如果你有非xhtml html则不太好。如果是这种情况，那么请按照系统化程序的建议选择BeautifulSoup。

使用lxml，这是在文档中的所有表周围插入一个部分的相当简单的方法：

import lxml.etree

TEST="<html><h1>...</html>"

def insert_section(root):
    tables = root.findall(".//table")
    for table in tables:
        section = ET.Element("section")
        table.addprevious(section)
        section.insert(0, table)   # this moves the table

root = ET.fromstring(TEST)
insert_section(root)
print ET.tostring(root)

您可以执行类似的操作来包装标题，但是您需要遍历要包装的所有元素并将它们移动到该部分。 element.index（子）和列表切片可能会有所帮助。

Answer 2

如果要解析某些xml文件，可以使用BeautifulSoup http://www.crummy.com/software/BeautifulSoup/

Beautiful Soup是将xml表示为python对象的好方法。然后，您可以编写python对象来分析html和添加/删除标记。因此，您可以使用is_h1函数来查找xml文件中的所有标记。然后你可以用漂亮的汤添加一个标签。

如果您想将此内容返回给浏览器，您可以使用HttpResponse，其参数是完成的xml产品的字符串表示形式。

Python lxml包装元素

2 个答案: