ElementTree文本与标签混合

时间:2015-12-16 18:07:19

标签: python html elementtree

想象以下文字:

<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>

我如何设法使用etree界面解析此问题?拥有description标记后,.text属性只返回第一个单词 - the.getchildren()方法返回<b>元素,但不返回文本的其余部分。

非常感谢!

1 个答案:

答案 0 :(得分:1)

获取.text_content()。使用lxml.html的工作示例:

from lxml.html import fromstring   

data = """
<description>
the thing <b>stuff</b> is very important for various reasons, notably <b>other things</b>.
</description>
"""

tree = fromstring(data)

print(tree.xpath("//description")[0].text_content().strip())

打印:

the thing stuff is very important for various reasons, notably other things.
  

我忘了说一件事,抱歉。我理想的解析版本将包含一个子章节列表:[normal(&#34; the thing&#34;),bold(&#34; stuff&#34;),normal(&#34; ....&#34) ;),是否可以使用lxml.html库?

假设您在描述中只包含文本节点和b元素:

for item in tree.xpath("//description/*|//description/text()"):
    print([item.strip(), 'normal'] if isinstance(item, basestring) else [item.text, 'bold'])

打印:

['the thing', 'normal']
['stuff', 'bold']
['is very important for various reasons, notably', 'normal']
['other things', 'bold']
['.', 'normal']