我对scrapy很新,而且在
之前我没有使用过正则表达式以下是我的spider.py
代码
class ExampleSpider(BaseSpider):
name = "test_code
allowed_domains = ["www.example.com"]
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
现在,如果我们查看start_urls
所有三个网址都是相同的,除了它们在整数值2?, 3?
上有所不同,依此类推,我的意思是根据网站上的网址无限制,我现在可以使用了crawlspider我们可以为URL构建正则表达式,如下所示,
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = [
"http://www.example.com/bookstore/new/1?filter=bookstore",
"http://www.example.com/bookstore/new/2?filter=bookstore",
"http://www.example.com/bookstore/new/3?filter=bookstore",
]
rules = (
Rule(SgmlLinkExtractor(allow=(........),))),
)
def parse(self, response):
hxs = HtmlXPathSelector(response)
你可以指导我,我怎样才能为上面的start_url
列表构建一个爬行蜘蛛规则。
答案 0 :(得分:4)
如果我理解正确,您需要大量具有特定模式的起始网址。
如果是这样,您可以覆盖BaseSpider.start_requests方法:
class ExampleSpider(BaseSpider):
name = "test_code"
allowed_domains = ["www.example.com"]
def start_requests(self):
for i in xrange(1000):
yield self.make_requests_from_url("http://www.example.com/bookstore/new/%d?filter=bookstore" % i)
...
答案 1 :(得分:0)
如果您使用的是CrawlSpider,那么覆盖解析方法通常不是一个好主意。
规则对象可以过滤您与您不关心的网址。
请参阅文档中的CrawlSpider以供参考。
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re
class ExampleSpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/bookstore']
rules = (
Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
)
def parse_boostore(self, response):
hxs = HtmlXPathSelector(response)