Question

我的目标是从网站上删除网址和标题列表，作为更大项目的一部分 - 这是驱使我学习scrapy的原因。现在，就目前而言，使用basespider刮取给定日期的第一页（格式为/ archive / date /）工作正常。但是，尝试使用crawlspider（处理一些教程）来刮取给定日期的每个顺序页面是行不通的，我不知道为什么。我尝试了很多解决方案。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request

class PhysURLSpider(CrawlSpider):
    date = raw_input("Please iput a date in the format M-DD-YYYY: ")
    name = "PhysURLCrawlSpider"
    allowed_domains = "phys.org"
    start_url_str = ("http://phys.org/archive/%s/") % (date)
    start_urls = [start_url_str]

    rules = (
        Rule (SgmlLinkExtractor(allow=("\d\.html",)), 
        callback="parse_items", follow = True),
    )

    #def parse_start_url(self, response):
        #request = Request(start_urls, callback = self.parse_items)


    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//article[@class='news-box news-detail-box     clearfix']/h4")
        items = []
        for titles in titles:
            item = PhysurlfetchItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

目前我已经将parse_start_url注释掉了，因为我试图阻止start_urls（使用变化的字符串）的方法失败了。运行此操作当前直接跳转到给定日期的第2页，而不从第1页抓取任何数据，然后停止（没有第2页数据，没有第3页）。

Answer 1

当我在本地运行你的蜘蛛时（使用scrapy runspider yourspider.py）我得到了这个控制台输出：

2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Filtered offsite request to 'phys.org': <GET http://phys.org/archive/5-12-2013/page2.html>
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)

你可以看到Scrapy是一个非现场查询的文件串。事实上，allowed_domains应该是一个域列表，因此如果您更改为allowed_domains = ["phys.org"]，您可以获得更多：

2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page2.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Filtered duplicate request: <GET http://phys.org/archive/5-12-2013/page3.html> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page8.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page6.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Redirecting (301) to <GET http://phys.org/archive/5-12-2013/> from <GET http://phys.org/archive/5-12-2013/page1.html>
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page4.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page7.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page5.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page3.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: http://phys.org/archive/5-12-2013/page2.html)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)

但蜘蛛没有拿起任何物品。它可能是也可能不是拼写错误，但titles的XPath表达式应该是//article[@class='news-box news-detail-box clearfix']/h4，即在clearfix之前没有额外的空格。

最后请注意，如果您使用最新的Scrapy版本（从版本0.20.0开始），您将能够使用CSS selectors，这在选择元素时可能比XPath更具可读性多个班级：

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request

class PhysURLSpider(CrawlSpider):
    date = raw_input("Please iput a date in the format M-DD-YYYY: ")
    name = "PhysURLCrawlSpider"
    allowed_domains = ["phys.org"]
    start_url_str = ("http://phys.org/archive/%s/") % (date)
    start_urls = [start_url_str]

    rules = (
        Rule (SgmlLinkExtractor(allow=("\d\.html",)),
        callback="parse_items", follow = True),
    )

    #def parse_start_url(self, response):
        #request = Request(start_urls, callback = self.parse_items)


    def parse_items(self, response):
        selector = Selector(response)

        # selecting only using "news-detail-box" class
        # you could use "article.news-box.news-detail-box.clearfix > h4"
        titles = selector.css("article.news-detail-box > h4")

        items = []
        for titles in titles:
            item = PhysurlfetchItem()
            item ["title"] = titles.xpath("a/text()").extract()
            item ["link"] = titles.xpath("a/@href").extract()
            items.append(item)
        self.log("%d items in %s" % (len(items), response.url))
        return items

如何进行递归抓取工作？

1 个答案: