写scrapy规则

时间:2014-01-13 00:25:26

标签: python scrapy

我不明白scrapy规则如何运作。假设我想抓一个网站,我希望它通过包含“类别”的链接。我想打开包含“product”的URL,然后将其传递给回调。我怎么写这个?

这有什么问题?

rules = (  
           Rule(SgmlLinkExtractor(allow=r'.*?categoryId.*'), follow=True),
           Rule(SgmlLinkExtractor(allow=r'.*?productId.*'), callback='parse_item'),
        )

我收到以下错误:

    Traceback (most recent call last):
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/base.py", line 824, in runUntilCurrent
        call.func(*call.args, **call.kw)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 638, in _tick
        taskObj._oneWorkUnit()
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/twisted/internet/task.py", line 484, in _oneWorkUnit
        result = next(self._iterator)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 57, in <genexpr>
        work = (callable(elem, *args, **named) for elem in iterable)
    --- <exception caught here> ---
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/utils/defer.py", line 96, in iter_errback
        yield next(it)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/offsite.py", line 23, in process_spider_output
        for x in result:
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/referer.py", line 22, in <genexpr>
        return (_set_referer(r) for r in result or ())
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/urllength.py", line 33, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spidermiddleware/depth.py", line 50, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 73, in _parse_response
        for request_or_item in self._requests_to_follow(response):
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/spiders/crawl.py", line 52, in _requests_to_follow
        links = [l for l in rule.link_extractor.extract_links(response) if l not in seen]
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 128, in extract_links
        links = self._extract_links(body, response.url, response.encoding, base_url)
      File "/home/scraper/.fakeroot/lib/python2.7/site-packages/scrapy/contrib/linkextractors/sgml.py", line 29, in _extract_links
        self.feed(response_text)
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 104, in feed
        self.goahead(0)
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 174, in goahead
        k = self.parse_declaration(i)
      File "/home/scraper/.fakeroot/lib/python2.7/markupbase.py", line 140, in parse_declaration
        "unexpected %r char in declaration" % rawdata[j])
      File "/home/scraper/.fakeroot/lib/python2.7/sgmllib.py", line 111, in error
        raise SGMLParseError(message)
    sgmllib.SGMLParseError: unexpected '=' char in declaration

1 个答案:

答案 0 :(得分:0)

试试这个:

rules = (
       Rule(SgmlLinkExtractor(allow=(r'.*?categoryId.*',)), follow=True),
       Rule(SgmlLinkExtractor(allow=(r'.*?productId.*',)), callback='parse_item'),
    )