接下来是一段代码,我必须尝试抓取超过1页的网站...我很难让规则类正常运行。我做错了什么?
#import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from tutorial.items import SkodaItem
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]/a',)), follow=True),
]
# def parse_item(self, response):
def parse(self, response):
#self.logger.info('Hi, this is an item page! %s', response.url)
x = 0
items = []
for sel in response.xpath('//*[@id="search-results"]/section[2]/article'):
x = x + 1
item = SkodaItem()
item["title"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').re('.+>(.+)</span>')
#print sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[1]/h2/a/span').extract()
item["leeftijd"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[1]').re('.+">(.+)</span>')
item["prijs"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[2]/div[1]/div/div').re('.+\n +(.+)\n.+')
item["km"] = sel.xpath('//*[@id="search-results"]/section[2]/article['+str(x)+']/div/div[1]/div[2]/span[3]').re('.+">(.+)</span>')
#handle output (print or safe to database)
items.append(item)
print item ["title"],item["leeftijd"],item["prijs"],item["km"]
答案 0 :(得分:0)
要改变的一些事情:
CrawlSpider
,you should not redefine the parse
method时,所有&#34;魔法&#34;适用于这种特殊的蜘蛛类型编写爬网蜘蛛规则时,请避免使用parse作为回调,因为CrawlSpider使用parse方法本身来实现其逻辑。因此,如果您覆盖解析方法,则爬网蜘蛛将不再起作用。
/a
来修复(链接中的链接将与任何元素都不匹配)CrawlSpider
规则需要回调方法parse_start_url
method 这是一个简约的CrawlSpider
,跟随您的示例输入的3页,并打印出多少&#34;文章&#34;每页都有:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class SkodaSpider(CrawlSpider):
name = "skodas"
allowed_domains = ["marktplaats.nl"]
start_urls = [
"http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always"
]
rules = [
Rule(LinkExtractor(restrict_xpaths=('//a[@class="button secondary medium pagination-next"]',)),
follow=True,
callback='parse_page'),
]
def parse_page(self, response):
articles = response.css('#search-results > section + section > article')
self.logger.info('%d articles' % len(articles))
# define this, otherwise "parse_page" will not be called for the URLs in start_urls
parse_start_url = parse_page
输出:
$ scrapy runspider 001.py
2016-02-09 11:07:16 [scrapy] INFO: Scrapy 1.0.4 started (bot: scrapybot)
2016-02-09 11:07:16 [scrapy] INFO: Optional features available: ssl, http11
2016-02-09 11:07:16 [scrapy] INFO: Overridden settings: {}
2016-02-09 11:07:16 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-02-09 11:07:16 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-02-09 11:07:16 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-02-09 11:07:16 [scrapy] INFO: Enabled item pipelines:
2016-02-09 11:07:16 [scrapy] INFO: Spider opened
2016-02-09 11:07:16 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-02-09 11:07:16 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-02-09 11:07:16 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always> (referer: None)
2016-02-09 11:07:16 [skodas] INFO: 32 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda/octavia-trekhaak-stationwagon.html?categoryId=151&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010&mileageTo=150.000&attributes=S%2C1185&attributes=S%2C484&attributes=M%2C11564&startDateFrom=always)
2016-02-09 11:07:17 [skodas] INFO: 30 articles
2016-02-09 11:07:17 [scrapy] DEBUG: Crawled (200) <GET http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=3&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010> (referer: http://www.marktplaats.nl/z/auto-s/skoda.html?attributes=S%2C1185+S%2C484+M%2C11564&categoryId=151¤tPage=2&mileageTo=150.000&priceFrom=1.000%2C00&priceTo=15.000%2C00&yearFrom=2010)
2016-02-09 11:07:17 [skodas] INFO: 7 articles
2016-02-09 11:07:17 [scrapy] INFO: Closing spider (finished)
2016-02-09 11:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1919,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 96682,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 2, 9, 10, 7, 17, 638179),
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'request_depth_max': 2,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2016, 2, 9, 10, 7, 16, 452272)}
2016-02-09 11:07:17 [scrapy] INFO: Spider closed (finished)