尝试使用lxml xpath选择先前节点

时间:2016-05-04 10:26:07

标签: python xslt xpath lxml

我正在尝试获取当前所选节点的先前兄弟但不确定我做错了什么。

这是html snap:

source = """
    <div class="zg_itemImmersion">
    <div class="zg_rankDiv"><span class="zg_rankNumber">10.</span></div>
    <div class="zg_itemWrapper" style="height:285px">
       <div class="zg_image">
          <div class="zg_itemImageImmersion"><a  href="
             http://www.amazon.com/Oral-B-Action-Replacement-Electric-Toothbrush/dp/B000AUIFCA/ref=zg_mw_8517148011_10"><img src="http://ecx.images-amazon.com/images/I/41RHKIQXnhL._SL160_SL150_.jpg" alt="Oral-B Floss Action Replacement Elect..." title="Oral-B Floss Action Replacement Elect..."/></a></div>
       </div>
    </div>
"""

如果href包含ASIN,我想获得的是rankNumber:B000AUIFCA,

from lxml import html 
source1 = html.fromstring(source)
links = source1.xpath('//div[@class="zg_itemImmersion"]//div[@class="zg_itemImageImmersion"]/a[contains(@href,"B000AUIFCA")]/@href')

上面给出了包含我所需的ASIN的正确链接:B000AUIFCA

['\n\n\n\n\n\n\nhttp://www.amazon.com/Oral-B-Action-Replacement-Electric-Toothbrush/dp/B000AUIFCA/ref=zg_mw_8517148011_10/191-4138574-0525467']

现在,如果[span class="zg_rankNumber"]中的ASIN == B000AUIFCA

,我希望前面的兄弟级别为“10” - ('//span[@class="zg_rankNumber"]//a//@href')

我正在使用:link2 = source1.xpath('//div[@class="zg_itemImmersion"]//div[@class="zg_itemImageImmersion"]/a[contains(@href,"B000AUIFCA")]/preceding-sibling::*/text()')

但是它返回Null

1 个答案:

答案 0 :(得分:2)

您可以使用以下XPath:

//div[@class="zg_itemImmersion"]
     [.//div[@class="zg_itemImageImmersion"]/a[contains(@href,"B000AUIFCA")]]
//span[@class="zg_rankNumber"]

XPath首先找到包含目标文本“ASIN:B000AUIFCA”的'zg_itemImmersion'div 。然后从div'返回zg_rankNumber'span