xpath仅适用于第一个图像

时间:2014-03-01 07:50:49

标签: python python-2.7 xpath scrapy

我在抓这个网站 http://www.propertyfinder.ae/en/buy/villa-for-sale-dubai-jumeirah-park-1849328.html?img/0

我希望在此代码div[@id='propertyPhoto']

中获取所有图像src

我试过这个xpath

.//div[@id='propertyPhoto']//img/@src

和他们我做了一个循环来提取src,但我只得到第一个图像src

请帮助

1 个答案:

答案 0 :(得分:1)

只有主要图片位于div#propertyPhoto。其他人在li#propertyPhotoMini0li#propertyPhotoMini1,...

因此,XPath应该略微修改以匹配两者。它们的id属性都以propertyPhoto开头;你可以使用以下XPath:

.//*[starts-with(@id, 'propertyPhoto')]//img/@src

示例:

import urllib
from scrapy.selector import Selector
url = 'http://www.propertyfinder.ae/en/buy/villa-for-sale-dubai-jumeirah-park-1849328.html?img/0'
h = urllib.urlopen(url).read()
root = Selector(text=h, type='html')
for url in root.xpath(".//*[starts-with(@id, 'propertyPhoto')]//img/@src").extract():
    print(url)

输出:

http://c1369023.r23.cf3.rackcdn.com/1849328-1-wide.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-1-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-2-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-3-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-4-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-5-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-6-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-7-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-8-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-9-mini.jpg
http://c1369023.r23.cf3.rackcdn.com/1849328-10-mini.jpg