让我们考虑一个示例XML文件:
<?xml version="1.0" encoding="ISO-8859-1"?>
<feats>
<feat>
<name>Blindsight, 5-Ft. Radius</name>
<type>General</type>
<multiple>No</multiple>
<stack>No</stack>
<prerequisite>Base attack bonus +4, Blind-Fight, Wisdom 19.</prerequisite>
<benefit><div topic="Benefit" level="8"><p><b>Benefit:</b> Using senses such as acute hearing and sensitivity to vibrations, you detect the location of opponents who are no more than 5 feet away from you. <i>Invisibility</i> and <i>darkness</i> are irrelevant, though it you discern incorporeal beings.</p><p/>
</div>
</benefit>
<full_text>
<div topic="Blindsight, 5-Ft. Radius" level="3">Lorem ipsum
</div>
</div>
</full_text>
<reference>SRD 3.5 DivineAbilitiesandFeats</reference>
</feat>
</feats>
我希望将<benefit>
标记中的文字作为字符串,但不包含<div>
标记(<p>
和<b>
不应删除)。所以在这种情况下,结果将是:
Using senses such as acute hearing and sensitivity to vibrations, you detect the location of opponents who are no more than 5 feet away from you. <i>Invisibility</i> and <i>darkness</i> are irrelevant, though it you discern incorporeal beings.</p><p/>
我设法获得了整个<div>
元素但是当我尝试使用.text
属性从中获取字符串时,它会给出mo None
。
tree = ET.parse(filename)
root = tree.getroot()
data={}
for item in root.findall('feat'):
data["benefit"]=""
element = item.find('benefit').find("div")
print element.text
是否有一个简单的方法可以获得这个文本,或者我必须为它编写特殊功能?
答案 0 :(得分:0)
我同意关于BeautifulSoup的mattR,但是
我在你的代码片段中添加了一些正则表达式,它给了我一个不错的结果
import xml.etree.ElementTree as ET
import re
tree = ET.parse('data.xml')
root = tree.getroot()
data = {};
result = [];
for item in root.iter('benefit'):
cleaned = re.sub(r'<[^>]*>', '', ET.tostring(item, encoding="utf-8"));
result.append(cleaned)
print result;
//result ['Benefit: Using senses such as acute hearing and sensitivity to vibrations, you detect the location of opponents who are no more than 5 feet away from you. Invisibility and darkness are irrelevant, though it you discern incorporeal beings.\n\n\n ']
答案 1 :(得分:0)
使用lxml
,您可以先找到<b>
元素,获取tail
并将其与follow-sibling元素组合以产生所需的结果,例如:
from lxml import etree as ET
raw = '''your XML string here'''
root = ET.fromstring(raw)
b = root.xpath("//benefit/div/p/b")[0]
result = b.tail + ''.join(ET.tostring(node) for node in b.xpath("following-sibling::*"))
print result
输出
Using senses such as acute hearing and sensitivity to vibrations, you detect the location of opponents who are no more than 5 feet away from you. <i>Invisibility</i> and <i>darkness</i> are irrelevant, though it you discern incorporeal beings.
或者,如果您只想简单地获取<p>
的内容,包括其中的标记,则可以this way(这个使用lxml
或xml.etree
:
p = root.find(".//benefit/div/p")
result = p.text + ''.join(ET.tostring(node) for node in p)
输出
<b>Benefit:</b> Using senses such as acute hearing and sensitivity to vibrations, you detect the location of opponents who are no more than 5 feet away from you. <i>Invisibility</i> and <i>darkness</i> are irrelevant, though it you discern incorporeal beings.