Question

import lxml.html as PARSER
from lxml.html import fromstring

data = """<TextFormat>06</TextFormat>
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>"""
root = PARSER.fromstring(data)

for ele in root.getiterator():
    if ele.tag == 'text':
        print ele.text_content()

这就是我现在得到的 - ＆gt; Ducdame是John Cowper Powysother的文字。

但我需要“Text”标签内的全部内容。这是我期待的结果。

<![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]>

我尝试过lxml，BeautifulSoup，但没有得到我期待的结果。我真的需要帮助。

由于

Answer 1

以下示例适用于minidom模块。

import xml.dom.minidom

data = """<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body></html>]]></Text>"""

p = xml.dom.minidom.parseString(data)
p = p.childNodes[0]
p = p.childNodes[0]
print p.toxml()

Answer 2

这是LXML的示例。为了找到正确的标签，请使用xpath，.//text：

from lxml import html
from lxml import etree

text = """<TextFormat>06</TextFormat>
<Text><![CDATA[<html><body><p>Ducdame was John Cowper Powys<p>other text</p></p></body>  </html>]]></Text>"""

tree = html.fromstring(text)
tags = tree.xpath('.//text')

text_tag = tags[-1]
print etree.tostring(text_tag)

<强>输出

'<text><p>Ducdame was John Cowper Powys</p><p>other text</p></text>'

如果您还需要CDATA，则可以找到以下有用的帖子：How to output CDATA using ElementTree

获取标记中的整个内容，包括html标记

2 个答案: