xpath提取链接或hrefs

时间:2016-05-15 20:16:56

标签: html xpath

我正试图从这里(使用xpath)从google playstore中提取类似应用的链接

https://play.google.com/store/apps/details?id=com.mojang.minecraftpe

以下是我想要提取的链接(标记为绿色)的屏幕截图 enter image description here

HTML示例

<div class="details"> 
  <a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>  
  <a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run 
    <span class="paragraph-end"/> 
  </a>  
  <div>....</div>  
  <div>....</div> 
</div>

我在chrome console中使用了下面的xpath来定位单个链接,但它没有返回标记的href属性。但对于其他属性,它可以工作(例如“标题”)。

xpath下面不起作用(提取“href”)

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@href

xpath下面的工作(提取“标题”)

//*[@id="body-content"]/div/div/div[2]/div[1]//*/a[2]/@title

enter image description here

Python代码

1 个答案:

答案 0 :(得分:1)

链接页面右侧各个图块的HTML格式如下*:

<div class="details"> 
  <a href="/store/apps/details?id=com.imangi.templerun" class="card-click-target"></a>  
  <a title="Temple Run" href="/store/apps/details?id=com.imangi.templerun" class="title">Temple Run 
    <span class="paragraph-end"/> 
  </a>  
  <div>....</div>  
  <div>....</div> 
</div>

原来,带有<a> class="title"元素可以唯一标识该页面中的目标<a>元素。所以XPath可以简单如下:

//a[@class="title"]/@href

无论如何,您注意到的问题似乎特定于Chrome XPath评估程序**。既然你提到了Python,简单的Python代码证明了XPath应该可以正常工作:

>>> from urllib2 import urlopen
>>> from lxml import html
>>> req = urlopen('https://play.google.com/store/apps/details?id=com.mojang.minecraftpe')
>>> raw = req.read()
>>> root = html.fromstring(raw)
>>> [h for h in root.xpath("//a[@class='title']/@href")]
['/store/apps/details?id=com.imangi.templerun', '/store/apps/details?id=com.lego.superheroes.dccomicsteamup', '/store/apps/details?id=com.turner.freefurall', '/store/apps/details?id=com.mtvn.Nickelodeon.GameOn', '/store/apps/details?id=com.disney.disneycrossyroad_goo', '/store/apps/details?id=com.rovio.angrybirdsstarwars.ads.iap', '/store/apps/details?id=com.rovio.angrybirdstransformers', '/store/apps/details?id=com.disney.dinostampede_goo', '/store/apps/details?id=com.turner.atskisafari', '/store/apps/details?id=com.moose.shopville', '/store/apps/details?id=com.DisneyDigitalBooks.SevenDMineTrain', '/store/apps/details?id=com.turner.copatoon', '/store/apps/details?id=com.turner.wbb2016', '/store/apps/details?id=com.tov.google.ben10Xenodrome', '/store/apps/details?id=com.turner.ggl.gumballrainbowruckus', '/store/apps/details?id=com.lego.starwars.theyodachronicles', '/store/apps/details?id=com.mojang.scrolls']

*)剥离版本。您可以将此作为提供最少HTML样本的示例。

**)我可以重现这个问题,@href在我的Chrome控制台中打印为空字符串。同样的问题也发生在其他人身上:Chrome element inspector Xpath with @href won't show link text