我正在开展一个scrapy项目,以获取此页面和以下页面中的所有网址,但是当我运行蜘蛛时,我只是从每个页面获得一个网址!我写了一个for循环来获取它们,但没有改变? 我需要将每个广告数据放入csv文件中的一行,该怎么做?
蜘蛛代码:
import datetime
import urlparse
import socket
import re
from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from cars2buy.items import Cars2BuyItem
class Cars2buyCarleasingSpider(CrawlSpider):
name = "cars2buy-carleasing"
start_urls = ['http://www.cars2buy.co.uk/business-car-leasing/']
rules = (
Rule(LinkExtractor(allow=("Abarth"), restrict_xpaths='//*[@id="content"]/div[7]/div[2]/div/a')),
Rule(LinkExtractor(allow=("695C"), restrict_xpaths='//*[@id="content"]/div/div/p/a'), callback='parse_item', follow=True),
Rule(LinkExtractor(restrict_xpaths='//*[@class="next"]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
for l in response.xpath('//*[@class="viewthiscar"]/@href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
return item
输出是:
> 2017-04-27 20:22:39 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.fleetprices.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-204097572&broker=178&veh_id=901651523&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/'}
> 2017-04-27 20:22:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> (referer: http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/)
> 2017-04-27 20:22:40 [scrapy.core.scraper] DEBUG: Scraped from <200
> http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2>
> {'Company':
> u'/clicks_cache_car_lease.php?url=http%3A%2F%2Fwww.jgleasing.co.uk%2Fbusiness-lease-cars%2Fabarth%2F695-cabriolet%2F14-t-jet-165-xsr-2dr-207378762&broker=248&veh_id=902250527&type=business&make=Abarth&model=695C&der=1.4
> T-Jet 165 XSR 2dr', 'url':
> 'http://www.cars2buy.co.uk/business-car-leasing/Abarth/695C/?leaf=2'}
> 2017-04-27 20:22:40 [scrapy.core.engine] INFO: Closing spider
> (finished)
答案 0 :(得分:1)
问题是,一旦你的for循环处理了它到达return
的第一个项目,就会离开parse_item
方法,因此不会处理任何其他项目。
建议您将return
替换为yield
:
def parse_item(self, response):
for l in response.xpath('//*[@class="viewthiscar"]/@href'):
item=Cars2BuyItem()
item['Company']= l.extract()
item['url']= response.url
yield item