我对scrapy蜘蛛有一个疑问。假设我有这个代码
name = 'myspider'
allowed_domains = ['domain.com']
start_urls = ['http://www.domain.com/foo/']
rules = (
Rule(SgmlLinkExtractor(allow=r'-\w+.html$'), callback='parse_item', follow=True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
我想知道蜘蛛首先会转到启动网址并解析页面然后使用规则来提取链接
或蜘蛛不会解析第一页但会以规则
开头我已经看到,如果我的规则不匹配,那么我没有得到任何结果,但至少不应该解析起始页
答案 0 :(得分:3)
我正在编写Michael Herman编写的示例教程https://github.com/mjhea0/Scrapy-Samples,该教程从BaseSpider示例开始,然后进展到CrawlSpider示例。第一个例子没什么大不了的,但是第二个例子没有抓第一页 - 只有第二页 - 而且我不知道我做错了什么。但是,当我从github运行代码时,我意识到他的代码也没有抓第一页!我想这与CrawlSpider和BaseSpider的意图有关,经过一些研究后,我想出了这个:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem
from scrapy.http import Request
class MySpider(CrawlSpider):
name = "CraigslistSpider"
allowed_domains = ["craigslist.org"]
start_urls = ["http://annapolis.craigslist.org/sof/"]
rules = (
Rule (SgmlLinkExtractor(allow=("index\d00\.html", ),
restrict_xpaths=('//p[@id="nextpage"]',)),
callback="parse_items", follow= True),
)
#
# Need to scrape first page...so we hack-it by creating request and
# sending the request to the parse_items callback
#
def parse_start_url(self, response):
print ('**********************')
request = Request("http://annapolis.craigslist.org/sof/", callback=self.parse_items)
return request
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select("//p")
items = []
for titles in titles:
item = CraigslistItem()
item ["title"] = titles.select("a/text()").extract()
item ["link"] = titles.select("a/@href").extract()
items.append(item)
return items
就我而言,我使用的是CrawlSpider,它要求我使用start_urls中的相同url(即第一页)实现'parse_start_url'来创建请求对象。然后,在第一页开始抓取。顺便说一下,我已经3天了,带着scrapy和python!