帮助请写xpath-expression。
HTML:
<div class="TabItem">
<p><strong>Product Composition</strong></p>
<p>93% Polyamide 7% Elastane</p>
<p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>
<p><strong>Product Attributes;</strong></p>
<p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
<p>Lining Type: Full Lining</p>
</div>
这需要获取以下html词典:
data['Product Composition'] = '93% Polyamide 7% Elastane Lining: 100% Polyester</p><p>Dress Length: 90 cm'
data['Product Attributes;'] = ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'
重要的是元素的数量可以变化。即你需要一个通用的解决方案
答案 0 :(得分:1)
获取strong
内的每个p
标记,然后获取它的父级和下一个父级的兄弟姐妹,直到另一个p
标记内部带有strong
标记,或者不再留下兄弟姐妹:
from lxml.html import fromstring
html_data = """<div class="TabItem">
<p><strong>Product Composition</strong></p>
<p>93% Polyamide 7% Elastane</p>
<p>Lining: 100% Polyester</p><p>Dress Length: 90 cm</p>
<p><strong>Product Attributes;</strong></p>
<p>: Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side</p>
<p>Lining Type: Full Lining</p>
</div>"""
tree = fromstring(html_data)
data = {}
for strong in tree.xpath('//p/strong'):
parent = strong.getparent()
description = []
next_p = parent.getnext()
while next_p is not None and not next_p.xpath('.//strong'):
description.append(next_p.text)
next_p = next_p.getnext()
data[strong.text] = " ".join(description)
print data
打印:
{'Product Composition': '93% Polyamide 7% Elastane Lining: 100% Polyester',
'Product Attributes;': ': Boat Neck, Long Sleeve, Midi, Zip, Concealed, Laced, Side Lining Type: Full Lining'}