我使用this CrawlerSpider example作为我的Crawler的“主干”。
我想实现这个想法:
第一条规则遵循链接。然后匹配的链接进一步传递到第二个规则,其中第二个规则根据模式匹配新链接并调用它们的回调。
例如,我有规则:
...
start_urls = ['http://play.google.com/store']
rules = (
Rule(SgmlLinkExtractor(allow=('/store/apps',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
...
我期望解析器如何工作:
打开http://play.google.com/store'并匹配第一个网址“https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free”
将找到的网址(“https://play.google.com/store/apps/category/SHOPPING/collection/topselling_free”)传递给第二条规则
第二条规则尝试匹配它的模式(allow =('。* / details \?id =',))),如果匹配,则调用该网址的回调'parse_app'。
Atm,Crawler只是遍历所有链接而不解析任何内容。
答案 0 :(得分:1)
正如徐家万所暗示的那样,匹配/details\?id=
的网址也匹配/store/apps
(来自我所看到的内容)
因此,请尝试更改规则的顺序,使parse_app
规则首先匹配:
rules = (
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
Rule(SgmlLinkExtractor(allow=('/store/apps',))),
)
或使用deny
rules = (
Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
如果您希望第一个Rule()仅在<http://play.google.com/store'上应用,然后使用第二个Rule()来调用parse_app
,则可能需要实现parse_start_url方法
使用SgmlLinkExtractor(allow=('/store/apps',))
像
这样的东西from scrapy.http import Request
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
class PlaystoreSpider(CrawlSpider):
name = 'playstore'
#allowed_domains = ['example.com']
start_urls = ['https://play.google.com/store']
rules = (
#Rule(SgmlLinkExtractor(allow=('/store/apps',), deny=('/details\?id=',))),
Rule(SgmlLinkExtractor(allow=('/details\?id=',)), callback='parse_app'),
)
def parse_app(self, response):
self.log('Hi, this is an app page! %s' % response.url)
# do something
def parse_start_url(self, response):
return [Request(url=link.url)
for link in SgmlLinkExtractor(
allow=('/store/apps',), deny=('/details\?id=',)
).extract_links(response)]