我在使用Scrapy Crawler爬网javascript网站时遇到了麻烦。似乎Scrapy忽略了规则,而只是继续正常的抓取。
是否可以指示Spider使用Splash进行爬网?
谢谢。
class MySpider(CrawlSpider):
name = 'booki'
start_urls = [
'https://worldmap.com/listings/in/united-states/',
]
rules = (
# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).
Rule(LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', ))),
# Extract links matching 'item.php' and parse them with the spider's method parse_item
Rule(LinkExtractor(allow=('catalogue', ),deny=('catalogue\/category')), callback='first_tier'),
# )
custom_settings = {
#'DOWNLOAD_DELAY' : '2',
'SPLASH_URL': 'http://localhost:8050',
'DOWNLOADER_MIDDLEWARES': {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'DOWNLOAD_DELAY' : '8',
'ITEM_PIPELINES' : {
'bookstoscrap.pipelines.BookstoscrapPipeline': 300,
}
}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.first_tier,
endpoint='render.html',
args={'wait': 3.5},
)
答案 0 :(得分:1)
仅当您在start_requests
之后真正进入匹配页面时,规则才会触发。您还需要为规则定义callback
函数,否则它们将尝试使用默认的parse
(以防您看起来好像规则无所事事)。
要将规则的请求更改为SplashRequest
,您必须在process_request
回调中将其返回。例如:
class MySpider(CrawlSpider):
# ...
rules = (
Rule(
LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', )),
process_request='splash_request'
),
Rule(
LinkExtractor(allow=('catalogue', ), deny=('catalogue\/category'),
callback='first_tier',
process_request='splash_request'
),
)
# ...
def splash_request(self, request):
return SplashRequest(
request.url,
callback=request.callback,
endpoint='render.html',
args={'wait': 3.5},
)