我有一个XML(实际上是XML样式表)。 使用Python,我想从中删除所有标记,只保留标记之间的文本。
对此最简单的解决方案是什么? 我在这里看到了类似的问题: How to remove all html tags from downloaded page
但由于某种原因,这种情况似乎并不适用。 请注意,我不希望在标签中保留由引号分隔的文本 - 我真的想要删除以“<”开头的所有内容并以“>”结尾。
答案 0 :(得分:3)
您可以使用xml.parsers.expat
:
from xml.parsers.expat import ParserCreate
def char_data(data):
if data.strip(): # skip empty text if you want
print data
parser = ParserCreate()
parser.CharacterDataHandler = char_data
parser.Parse(doc,True)
或xml.sax
:
from xml.sax import make_parser, handler
class extract_text(handler.ContentHandler):
def characters(self,data):
if data.strip():
print data
parser = make_parser()
parser.setContentHandler(extract_text())
parser.feed(doc)
如果它不是格式良好的XML,您也可以尝试HTMLParser
:
from HTMLParser import HTMLParser
class extract_text(HTMLParser):
def handle_data(self,data):
if data.strip():
print data
parser = extract_text()
parser.feed(doc)
答案 1 :(得分:0)
使用ElementTree
API(或更快的API等效lxml
),然后使用etree.totext(tree, method='text')
函数将树序列化回文本内容:
>>> from xml.etree import ElementTree as ET
>>> doc='''\
... <?xml-stylesheet href="common.css"?>
... <?xml-stylesheet href="modern.css"
... title="Modern" media="screen"
... type="text/css"?>
... <?xml-stylesheet href="classic.css"
... alternate="yes" title="Classic"
... media="screen, print" type="text/css"?>
... <ARTICLE>
... <HEADLINE>Fredrick the Great meets
... Bach</HEADLINE>
... <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
... <PARA>
... One evening, just as he was
... getting his
... <INSTRUMENT>flute</INSTRUMENT>
... ready and his musicians were
... assembled, an officer brought him a
... list of the strangers who had arrived.
... </PARA>
... </ARTICLE>
... '''
>>> tree = ET.fromstring(doc)
>>> ET.tostring(tree, method='text')
'\n Fredrick the Great meets\n Bach\n Johann Nikolaus Forkel\n \n One evening, just as he was\n getting his\n flute\n ready and his musicians were\n assembled, an officer brought him a\n list of the strangers who had arrived.\n \n'
答案 2 :(得分:0)
Lxml可能会有问题,你可以做Martijn Pieters用ElementTree或C版cElementTree在标准库中说的确切内容。
>>> from xml.etree import ElementTree
>>> doc='''
... <?xml-stylesheet href="common.css"?>
... <?xml-stylesheet href="modern.css"
... title="Modern" media="screen"
... type="text/css"?>
... <?xml-stylesheet href="classic.css"
... alternate="yes" title="Classic"
... media="screen, print" type="text/css"?>
... <ARTICLE>
... <HEADLINE>Fredrick the Great meets
... Bach</HEADLINE>
... <AUTHOR>Johann Nikolaus Forkel</AUTHOR>
... <PARA>
... One evening, just as he was
... getting his
... <INSTRUMENT>flute</INSTRUMENT>
... ready and his musicians were
... assembled, an officer brought him a
... list of the strangers who had arrived.
... </PARA>
... </ARTICLE>
... '''
>>> xml = ElementTree.fromstring(doc)
>>> xml
<Element 'ARTICLE' at 0x9295e6c>
>>> ElementTree.tostring(xml,method='text')
'\n Fredrick the Great meets\n Bach\n Johann Nikolaus Forkel\n \n One evening, just as he was\n getting his\n flute\n ready and his musicians were\n assembled, an officer brought him a\n li
st of the strangers who had arrived.\n \n '
请注意,cElementTree更快,它在标准库中,但我认为它对UTF8有一些问题,所以如果你需要utf8使用“ElementTree”