应用错误收集

我试图使用scrapy来抓取一个基于phpbb的论坛。我对scrapy的知识水平非常基本（但正在改善）。

提取论坛帖子的第一页的内容或多或少都很容易。我成功的刮刀是这样的：

import scrapy

from ptmya1.items import Ptmya1Item

class bastospider3(scrapy.Spider):
    name = "basto3"
    allowed_domains = ["portierramaryaire.com"]
    start_urls = [
        "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
    ]

    def parse(self, response):
        for sel in response.xpath('//div[2]/div'):
            item = Ptmya1Item()
            item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
            item['date'] = sel.xpath('div/div[1]/p/text()').extract()
            item['body'] = sel.xpath('div/div[1]/div/text()').extract()
            yield item

然而，当我尝试使用＆＃34;下一页＆＃34;链接我经历了很多令人沮丧的时间后失败了。我想告诉你我的尝试，以便征求意见。 注意：我更倾向于获得SgmlLinkExtractor变体的解决方案，因为它们更灵活，更强大，但我在经过多次尝试后优先获得成功

第一个，带限制路径的SgmlLinkExtractor。＆＃39;下一页xpath＆＃39;是

/html/body/div[1]/div[2]/form[1]/fieldset/a

确实，我用shell测试了

response.xpath('//div[2]/form[1]/fieldset/a/@href')[1].extract()

为＆＃34;下一页＆＃34;返回正确的值链接。但是，我想指出引用的xpath提供 TWO 链接

 >>> response.xpath('//div[2]/form[1]/fieldset/a/@href').extract()
[u'./search.php?sid=5aa2b92bec28a93c85956e83f2f62c08', u'./viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&sid=5aa2b92bec28a93c85956e83f2f62c08&start=15']

因此，我的失败的刮刀是

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from ptmya1.items import Ptmya1Item

class bastospider3(scrapy.Spider):
    name = "basto7"
    allowed_domains = ["portierramaryaire.com"]
    start_urls = [
        "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
    ]

    rules = (
            Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//div[2]/form[1]/fieldset/a/@href')[1],), callback="parse_items", follow= True)
            )

    def parse_item(self, response):
        for sel in response.xpath('//div[2]/div'):
            item = Ptmya1Item()
            item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
            item['date'] = sel.xpath('div/div[1]/p/text()').extract()
            item['body'] = sel.xpath('div/div[1]/div/text()').extract()
            yield item

第二个，带允许的SgmlLinkExtractor。更原始也不成功

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector

from ptmya1.items import Ptmya1Item

class bastospider3(scrapy.Spider):
    name = "basto7"
    allowed_domains = ["portierramaryaire.com"]
    start_urls = [
        "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
    ]

    rules = (
            Rule(SgmlLinkExtractor(allow=(r'viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a&start.',),), callback="parse_items", follow= True)
            )

    def parse_item(self, response):
        for sel in response.xpath('//div[2]/div'):
            item = Ptmya1Item()
            item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
            item['date'] = sel.xpath('div/div[1]/p/text()').extract()
            item['body'] = sel.xpath('div/div[1]/div/text()').extract()
            yield item

最后，我回到了该死的旧石器时代，或者它的第一个教程等同。我尝试使用初学者教程末尾的循环。另一个失败

import scrapy
import urlparse

from ptmya1.items import Ptmya1Item

class bastospider5(scrapy.Spider):
    name = "basto5"
    allowed_domains = ["portierramaryaire.com"]
    start_urls = [
        "http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a"
    ]

    def parse_articles_follow_next_page(self, response):
        item = Ptmya1Item()
        item['cacho'] = response.xpath('//div[2]/form[1]/fieldset/a/@href').extract()[1][1:] + "http://portierramaryaire.com/foro"
        for sel in response.xpath('//div[2]/div'):
            item['author'] = sel.xpath('div/div[1]/p/strong/a/text()').extract()
            item['date'] = sel.xpath('div/div[1]/p/text()').extract()
            item['body'] = sel.xpath('div/div[1]/div/text()').extract()
            yield item

        next_page = response.xpath('//fieldset/a[@class="right-box right"]')
        if next_page:
           cadenanext = response.xpath('//div[2]/form[1]/fieldset/a/@href').extract()[1][1:]
           url = urlparse.urljoin("http://portierramaryaire.com/foro",cadenanext)
           yield scrapy.Request(url, self.parse_articles_follow_next_page)

在所有情况中，我所获得的是一个神秘的错误消息，我无法从中获得解决问题的提示。

2015-10-08 21:24:46 [scrapy] DEBUG: Crawled (200) <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None)
2015-10-08 21:24:46 [scrapy] ERROR: Spider error processing <GET http://portierramaryaire.com/foro/viewtopic.php?f=3&t=3821&st=0&sk=t&sd=a> (referer: None)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 577, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/spiders/__init__.py", line 76, in parse
    raise NotImplementedError
NotImplementedError
2015-10-08 21:24:46 [scrapy] INFO: Closing spider (finished)

我真的很感激这个问题的任何建议（或更好，一个有效的解决方案）。我完全坚持这一点，无论我阅读多少，我都无法找到解决方案:(

出现神秘错误消息是因为您没有使用parse方法。当它想要解析响应时，这是scrapy的默认入口点。

但是，您只定义了parse_articles_follow_next_page或parse_item函数 - 这些函数绝对不是parse函数。

这不是因为下一个网站而是第一个网站：Scrapy无法解析start_url因此无论如何都无法尝试您的尝试。尝试将parse_items更改为parse，然后再次执行以获取旧石器时代的解决方案。

如果您使用Rule，则需要使用其他蜘蛛。对于那些使用CrawlSpider，您可以在教程中看到。在这种情况下，请勿覆盖parse方法，但请像您一样使用parse_items。这是因为CrawlSpider使用parse将响应转发给回调方法。

使用scrapy递归来刮取phpBB论坛

3 个答案: