我无法废弃以下网址中的数据。我试图废弃它,但在我的机器中它显示了一些不相关的数据。
编码:
hxs.select('//h3[@class="newaps"]/a/@href').extract()
预期产出:
对于上面我的代码的URL 1
我的预期输出是。
1.http://www.amazon.com/Samsung-RF4287-Refrigerator-Integrated-Stainless/dp/B003M5L284/ref=sr_1_2/189-7301776-6362144?ie=UTF8&qid=1391314328&sr=8-2&keywords=samsung+appliances
2. http://www.amazon.com/Samsung-RF4289HARS/dp/B004XQHBHC/ref=sr_1_3/189-7301776-6362144?ie=UTF8&qid=1391314328&sr=8-3&keywords=samsung+appliances
3.http://www.amazon.com/Samsung-DC47-00019A-Heating-Element/dp/B001ICYB2M/ref=sr_1_4/189-7301776-6362144?ie=UTF8&qid=1391314328&sr=8-4&keywords=samsung+appliances
etc......
我需要上面的代码,然后我也需要URL 2 ..
我需要BOth URL 1和URL 2的结果......
查看我的代码.........
From scrapy.spider import BaseSpider
from scrapy.http import Request
from urlparse import urljoin
from scrapy.selector import HtmlXPathSelector
import inspect
from amazon.items import AmazonItem
class amzspider(BaseSpider):
name="amz"
start_urls=["http://www.amazon.com/s/ref=sr_pg_2?rh=i%3Aaps%2Ck%3Asamsung+appliances&page=2&keywords=samsung+appliances&ie=UTF8&qid=1386153209"]
print start_urls
def parse(self,response):
hxs = HtmlXPathSelector(response)
ul=hxs.select('//div/ul[@class="rsltGridList grey"]').extract()
l=len(hxs.select('//h3[@class="newaps"]/a/@href').extract())
x=[]
x1=[]
url1=[]
for i in range(l):
x1.append(hxs.select('//h3[@class="newaps"]/a/@href').extract()[i].encode('utf-8').strip())
print "URl parsed"
for i in range(l):
url1.append(urljoin(response.url, x1[i]))
for i in range(l):
if url1[i]:
yield Request(url1[i], callback=self.parse_sub)
r=hxs.select('//a[@id="pagnNextLink"]/@href').extract()[0].encode('utf-8')
if r:
yield Request(urljoin(response.url, r), callback=self.parse)
def parse_sub(self,response):
print " sub callled"
itm=[]
# item = response.meta.get('item')
item=AmazonItem()
hxs = HtmlXPathSelector(response)