如何进行递归抓取工作?

时间:2014-01-09 20:25:51

标签: python scrapy

我的目标是从网站上删除网址和标题列表,作为更大项目的一部分 - 这是驱使我学习scrapy的原因。现在,就目前而言,使用basespider刮取给定日期的第一页(格式为/ archive / date /)工作正常。但是,尝试使用crawlspider(处理一些教程)来刮取给定日期的每个顺序页面是行不通的,我不知道为什么。我尝试了很多解决方案。

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request

class PhysURLSpider(CrawlSpider):
    date = raw_input("Please iput a date in the format M-DD-YYYY: ")
    name = "PhysURLCrawlSpider"
    allowed_domains = "phys.org"
    start_url_str = ("http://phys.org/archive/%s/") % (date)
    start_urls = [start_url_str]

    rules = (
        Rule (SgmlLinkExtractor(allow=("\d\.html",)), 
        callback="parse_items", follow = True),
    )

    #def parse_start_url(self, response):
        #request = Request(start_urls, callback = self.parse_items)


    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//article[@class='news-box news-detail-box     clearfix']/h4")
        items = []
        for titles in titles:
            item = PhysurlfetchItem()
            item ["title"] = titles.select("a/text()").extract()
            item ["link"] = titles.select("a/@href").extract()
            items.append(item)
        return items

目前我已经将parse_start_url注释掉了,因为我试图阻止start_urls(使用变化的字符串)的方法失败了。运行此操作当前直接跳转到给定日期的第2页,而不从第1页抓取任何数据,然后停止(没有第2页数据,没有第3页)。

1 个答案:

答案 0 :(得分:1)

当我在本地运行你的蜘蛛时(使用scrapy runspider yourspider.py)我得到了这个控制台输出:

2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] DEBUG: Filtered offsite request to 'phys.org': <GET http://phys.org/archive/5-12-2013/page2.html>
2014-01-10 13:30:19+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)

你可以看到Scrapy是一个非现场查询的文件串。事实上,allowed_domains应该是一个域列表,因此如果您更改为allowed_domains = ["phys.org"],您可以获得更多:

2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: None)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page2.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:00+0100 [PhysURLCrawlSpider] DEBUG: Filtered duplicate request: <GET http://phys.org/archive/5-12-2013/page3.html> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page8.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page6.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Redirecting (301) to <GET http://phys.org/archive/5-12-2013/> from <GET http://phys.org/archive/5-12-2013/page1.html>
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page4.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page7.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page5.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/page3.html> (referer: http://phys.org/archive/5-12-2013/)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] DEBUG: Crawled (200) <GET http://phys.org/archive/5-12-2013/> (referer: http://phys.org/archive/5-12-2013/page2.html)
2014-01-10 13:32:01+0100 [PhysURLCrawlSpider] INFO: Closing spider (finished)

但蜘蛛没有拿起任何物品。它可能是也可能不是拼写错误,但titles的XPath表达式应该是//article[@class='news-box news-detail-box clearfix']/h4,即在clearfix之前没有额外的空格。

最后请注意,如果您使用最新的Scrapy版本(从版本0.20.0开始),您将能够使用CSS selectors,这在选择元素时可能比XPath更具可读性多个班级:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from physurlfetch.items import PhysurlfetchItem
from scrapy.http import Request

class PhysURLSpider(CrawlSpider):
    date = raw_input("Please iput a date in the format M-DD-YYYY: ")
    name = "PhysURLCrawlSpider"
    allowed_domains = ["phys.org"]
    start_url_str = ("http://phys.org/archive/%s/") % (date)
    start_urls = [start_url_str]

    rules = (
        Rule (SgmlLinkExtractor(allow=("\d\.html",)),
        callback="parse_items", follow = True),
    )

    #def parse_start_url(self, response):
        #request = Request(start_urls, callback = self.parse_items)


    def parse_items(self, response):
        selector = Selector(response)

        # selecting only using "news-detail-box" class
        # you could use "article.news-box.news-detail-box.clearfix > h4"
        titles = selector.css("article.news-detail-box > h4")

        items = []
        for titles in titles:
            item = PhysurlfetchItem()
            item ["title"] = titles.xpath("a/text()").extract()
            item ["link"] = titles.xpath("a/@href").extract()
            items.append(item)
        self.log("%d items in %s" % (len(items), response.url))
        return items