对于以下xhtml代码段,我需要使用BS4或xpath从结构化html中获取属性值对,属性名称存在于h5标记中,其值可以在span标记或ap标记中显示。
对于下面的代码我应该得到以下输出作为字典:
养老管理:'动物:牛农:史密斯先生,'
Milch类别:'牛奶供应'
服务:'牛奶,酥油'
动物颜色:'红色,绿色......'
<div id="animalcontainer" class="container last fixed-height">
<h5>
Husbandary Management
</h5>
<span>
Animal: Cow
</span>
<span>
Farmer: Mr smith
</span>
<h5>
Milch Category
</h5>
<p>
Milk supply
</p>
<h5>
Services
</h5>
<p>
cow milk, ghee
</p>
<h5>
animal colors
</h5>
<span>
green,red
</span>
</div>
htmlcode.findAll(&#39; h5&#39;)找到h5元素,但我希望h5元素和后续元素都在另一个元素之前&#39; h5&#39;
答案 0 :(得分:2)
使用lxml.html
和XPath的示例解决方案:
h5
元素h5
元素,
following-sibling::*
h5
本身, - [not(self::h5)]
h5
个数字 - [count(preceding-sibling::h5) = 1]
然后是2,然后是3 ...... (for
循环enumerate()
从1开始)
示例代码,包含元素文本内容的简单打印(在元素上使用lxml.html
&#39; s .text_content()
):
import lxml.html
html = """<div id="animalcontainer" class="container last fixed-height">
<h5>
Husbandary Management
</h5>
<span>
Animal: Cow
</span>
<span>
Farmer: Mr smith
</span>
<h5>
Milch Category
</h5>
<p>
Milk supply
</p>
<h5>
Services
</h5>
<p>
cow milk, ghee
</p>
<h5>
animal colors
</h5>
<span>
green,red
</span>
</div>"""
doc = lxml.html.fromstring(html)
headers = doc.xpath('//div/h5')
for i, header in enumerate(headers, start=1):
print "--------------------------------"
print header.text_content().strip()
for following in header.xpath("""following-sibling::*
[not(self::h5)]
[count(preceding-sibling::h5) = %d]""" % i):
print "\t", following.text_content().strip()
输出:
--------------------------------
Husbandary Management
Animal: Cow
Farmer: Mr smith
--------------------------------
Milch Category
Milk supply
--------------------------------
Services
cow milk, ghee
--------------------------------
animal colors
green,red
答案 1 :(得分:0)
我终于使用BS做了它,似乎可以更有效地完成,因为以下解决方案每次都会重新生成兄弟姐妹:
h5s=addinfo.findAll('h5')
txtcontents=[]
datad={}
for h5el in h5s:
hcontents=list(h5el.nextSiblingGenerator())
txtcontents=[]
for con in hcontents:
try:
if con.name=='h5':
break
except AttributeError:
print "error:",con
continue
txtcontents.append(con.contents)
datad["\n".join(h5el.contents)]=txtcontents
print datad