我无法抓取整个网站,Scrapy只是在地面爬行,我想爬得更深。谷歌搜索过去5-6小时,没有任何帮助。我的代码如下:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
class ExampleSpider(CrawlSpider):
name = "example.com"
allowed_domains = ["example.com"]
start_urls = ["http://www.example.com/"]
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
答案 0 :(得分:6)
规则短路,意味着链接满足的第一条规则将是应用的规则,不会调用第二条规则(带回调)。
将规则更改为:
rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]
答案 1 :(得分:2)
解析start_urls
时,标签href
可以解析更深层的网址。然后,可以在函数parse()
中产生更深的请求。 Here is a simple example。最重要的源代码如下所示:
from scrapy.spiders import Spider
from tutsplus.items import TutsplusItem
from scrapy.http import Request
import re
class MySpider(Spider):
name = "tutsplus"
allowed_domains = ["code.tutsplus.com"]
start_urls = ["http://code.tutsplus.com/"]
def parse(self, response):
links = response.xpath('//a/@href').extract()
# We stored already crawled links in this list
crawledLinks = []
# Pattern to check proper link
# I only want to get tutorial posts
linkPattern = re.compile("^\/tutorials\?page=\d+")
for link in links:
# If it is a proper link and is not checked yet, yield it to the Spider
if linkPattern.match(link) and not link in crawledLinks:
link = "http://code.tutsplus.com" + link
crawledLinks.append(link)
yield Request(link, self.parse)
titles = response.xpath('//a[contains(@class, "posts__post-title")]/h1/text()').extract()
for title in titles:
item = TutsplusItem()
item["title"] = title
yield item