我不明白scrapy规则如何运作。假设我想抓一个网站,我希望它通过包含“类别”的链接。我想打开包含“product”的URL,然后将其传递给回调。我怎么写这个?
这有什么问题?
rules = (
Rule(SgmlLinkExtractor(allow=r'.*?categoryId.*'), follow=True),
Rule(SgmlLinkExtractor(allow=r'.*?productId.*'), callback='parse_item'),
)
我收到以下错误:
Traceback (most recent call last):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
taskObj._oneWorkUnit()
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
result = next(self._iterator)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
--- <exception caught here> ---
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
yield next(it)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
for x in result:
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
return (_set_referer(r) for r in result or ())
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
return (r for r in result or () if _filter(r))
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
for request_or_item in self._requests_to_follow(response):
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 128, in extract_links
links = self._extract_links(body, response.url, response.encoding, base_url)
File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 29, in _extract_links
self.feed(response_text)
File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 104, in feed
self.goahead(0)
File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 174, in goahead
k = self.parse_declaration(i)
File "/home/scraper/.fakeroot/lib/python2.7/markupbase.py", line 140, in parse_declaration
"unexpected %r char in declaration" % rawdata[j])
File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 111, in error
raise SGMLParseError(message)
sgmllib.SGMLParseError: unexpected '=' char in declaration
答案 0 :(得分:0)
试试这个:
rules = (
Rule(SgmlLinkExtractor(allow=(r'.*?categoryId.*',)), follow=True),
Rule(SgmlLinkExtractor(allow=(r'.*?productId.*',)), callback='parse_item'),
)