Question

假设我有一个这样的HTML代码段：

<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>

删除周围根元素的最佳/最强大的方法是什么，所以看起来像这样：

Hello <strong>There</strong>
<div>I think <em>I am</em> feeing better!</div>
<div>Don't you?</div>
Yup!

我尝试过像这样使用lxml.html：

lxml.html.fromstring（fragment_string）.drop_tag（）

但这只给了我“你好”，我觉得这很有道理。有更好的想法吗？

Answer 1

这在lxml（或ElementTree）中有点奇怪。你必须这样做：

def inner_html(el):
    return (el.text or '') + ''.join(tostring(child) for child in el)

请注意，除了以单个元素为根之外，lxml（和ElementTree）没有特殊的方式来表示文档，但如果.drop_tag()不是根元素，<div>将按您的要求工作。

Answer 2

您可以使用 BeautifulSoup 包。对于这个特定的HTML，我会这样：

import BeautifulSoup

html = """<div>
  Hello <strong>There</strong>
  <div>I think <em>I am</em> feeing better!</div>
  <div>Don't you?</div>
  Yup!
</div>"""

bs = BeautifulSoup.BeautifulSoup(html)

no_root = '\n'.join(map(unicode, bs.div.contents))

BeautifulSoup有许多不错的功能，可以让你在很多其他情况下调整这个例子。完整文档：http://www.crummy.com/software/BeautifulSoup/documentation.html。

Answer 3

对于这样一个简单的任务，你可以像使用regexp一样 r'<(.*?)>(.*)</\1>'并从中获取匹配＃2（perl术语中的\ 2）

您还应该为ms添加标记以确保正确的多行工作

在Python中，如何删除HTML片段中的“root”标记？

3 个答案: