Scrapy - 如何根据已删除项目中的链接抓取新页面

时间:2014-05-27 05:46:33

标签: python scrapy

我是Scrapy的新手,我正在尝试从已删除项目中的链接中删除新页面。具体来说,我想从谷歌搜索结果中删除Dropbox上的一些文件共享链接,并将这些链接存储在JSON文件中。获取这些链接后,我想为每个链接打开一个新页面,以验证链接是否有效。如果它有效,我也想将文件名存储到JSON文件中。

我使用DropboxItem,其中包含属性' link',' filename',' status',' err_msg'存储每个已删除的项目,我尝试为解析函数中的每个已删除链接发起异步请求。但似乎永远不会调用parse_file_page函数。有谁知道如何实现这样的两步爬行?

    class DropboxSpider(Spider):
        name = "dropbox"
        allowed_domains = ["google.com"]
        start_urls = [
            "https://www.google.com/#filter=0&q=site:www.dropbox.com/s/&start=0"
    ]

        def parse(self, response):
            sel = Selector(response)
            sites = sel.xpath("//h3[@class='r']")
            items = []
            for site in sites:
                item = DropboxItem()
                link = site.xpath('a/@href').extract()
                item['link'] = link
                link = ''.join(link)
                #I want to parse a new page with url=link here
                new_request = Request(link, callback=self.parse_file_page)
                new_request.meta['item'] = item
                items.append(item)
            return items

        def parse_file_page(self, response):
            #item passed from request
            item = response.meta['item']
            #selector
            sel = Selector(response)
            content_area = sel.xpath("//div[@id='shmodel-content-area']")
            filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
            if filename_area:
                filename = filename_area.xpath("span[@id]/text()").extract()
                if filename:
                    item['filename'] = filename             
                    item['status'] = "normal"
            else:
                err_area = content_area.xpath("div[@class='err']")
                if err_area:
                    err_msg = err_area.xpath("h3/text()").extract()
                    item['err_msg'] = err_msg
                    item['status'] = "error"
            return item

感谢@ScrapyNovice的回答。我修改了代码。现在看起来像

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath("//h3[@class='r']")
    #items = []
    for site in sites:
        item = DropboxItem()
        link = site.xpath('a/@href').extract()
        item['link'] = link
        link = ''.join(link)
        print 'link!!!!!!=', link
        new_request = Request(link, callback=self.parse_file_page)
        new_request.meta['item'] = item
        yield new_request
        #items.append(item)
    yield item
    return
    #return item   #Note, when I simply return item here, got an error msg "SyntaxError: 'return' with argument inside generator"

def parse_file_page(self, response):
    #item passed from request
    print 'parse_file_page!!!'
    item = response.meta['item']
    #selector
    sel = Selector(response)
    content_area = sel.xpath("//div[@id='shmodel-content-area']")
    filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
    if filename_area:
        filename = filename_area.xpath("span[@id]/text()").extract()
        if filename:
            item['filename'] = filename
            item['status'] = "normal"
            item['err_msg'] = "none"
            print 'filename=', filename
    else:
        err_area = content_area.xpath("div[@class='err']")
        if err_area:
            err_msg = err_area.xpath("h3/text()").extract()
            item['filename'] = "null"
            item['err_msg'] = err_msg
            item['status'] = "error"
            print 'err_msg', err_msg
        else:
            item['filename'] = "null"
            item['err_msg'] = "unknown_err"
            item['status'] = "error"
            print 'unknown err'
    return item

控制流程实际上变得非常奇怪。当我使用" scrapy crawl dropbox -o items_dropbox.json -t json"要抓取本地文件(谷歌搜索结果的下载页面),我可以看到像

这样的输出
2014-05-31 08:40:35-0400 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-05-31 08:40:35-0400 [scrapy] INFO: Optional features available: ssl, http11
2014-05-31 08:40:35-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['tutorial.spiders'], 'FEED_URI': 'items_dropbox.json', 'BOT_NAME': 'tutorial'}
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled item pipelines: 
2014-05-31 08:40:35-0400 [dropbox] INFO: Spider opened
2014-05-31 08:40:35-0400 [dropbox] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Crawled (200) <GET file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html> (referer: None)
link!!!!!!= http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0
link!!!!!!= https://www.dropbox.com/s/
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Filtered offsite request to 'www.dropbox.com': <GET https://www.dropbox.com/s/>
link!!!!!!= https://www.dropbox.com/s/awg9oeyychug66w
link!!!!!!= http://www.dropbox.com/s/kfmoyq9y4vrz8fm
link!!!!!!= https://www.dropbox.com/s/pvsp4uz6gejjhel
....  many links here
link!!!!!!= https://www.dropbox.com/s/gavgg48733m3918/MailCheck.xlsx
link!!!!!!= http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Scraped from <200 file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html>
    {'link': [u'http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk']}
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Crawled (200) <GET http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0> (referer: file:///home/xin/Downloads/dropbox_s/dropbox_s_1-Google.html)
parse_file_page!!!
unknown err
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Scraped from <200 http://www.google.com/intl/en/webmasters/>
    {'err_msg': 'unknown_err',
     'filename': 'null',
     'link': [u'http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0'],
     'status': 'error'}
2014-05-31 08:40:35-0400 [dropbox] INFO: Closing spider (finished)
2014-05-31 08:40:35-0400 [dropbox] INFO: Stored json feed (2 items) in: items_dropbox.json
2014-05-31 08:40:35-0400 [dropbox] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 558,
     'downloader/request_count': 2,
     'downloader/request_method_count/GET': 2,
     'downloader/response_bytes': 449979,
     'downloader/response_count': 2,
     'downloader/response_status_count/200': 2,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 31, 12, 40, 35, 348058),
     'item_scraped_count': 2,
     'log_count/DEBUG': 7,
     'log_count/INFO': 8,
     'request_depth_max': 1,
     'response_received_count': 2,
     'scheduler/dequeued': 2,
     'scheduler/dequeued/memory': 2,
     'scheduler/enqueued': 2,
     'scheduler/enqueued/memory': 2,
     'start_time': datetime.datetime(2014, 5, 31, 12, 40, 35, 249309)}
2014-05-31 08:40:35-0400 [dropbox] INFO: Spider closed (finished)

现在json文件只有:

[{"link": ["http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk"]},
{"status": "error", "err_msg": "unknown_err", "link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"], "filename": "null"}]

1 个答案:

答案 0 :(得分:4)

您正在创建一个请求并很好地设置回调,但您永远不会任何事情。

        for site in sites:
            item = DropboxItem()
            link = site.xpath('a/@href').extract()
            item['link'] = link
            link = ''.join(link)
            #I want to parse a new page with url=link here
            new_request = Request(link, callback=self.parse_file_page)
            new_request.meta['item'] = item
            yield new_request
            # Don't do this here because you're adding your Item twice.
            #items.append(item)

在更多设计级别上,您将所有已删除的项目存储在items末尾的parse()中,但是管道通常希望接收单个项目,而不是它们的数组。摆脱items数组,您将能够使用Scrapy内置的JSON Feed Export以JSON格式存储结果。

更新

当您尝试返回项目时收到错误消息的原因是因为在函数中使用yield会将其转换为生成器。这允许您重复调用该函数。每次达到收益时,它都会返回您正在收益的值,但会记住它的状态以及它正在执行的行。下次调用生成器时,它会从上次停止的位置继续执行。如果没有收获的东西,则会引发StopIteration异常。在Python 2中,您不能在同一个函数中混用yieldreturn

您不希望从parse()中获得任何项目,因为他们在此时仍然遗漏了filenamestatus等。

parse()所提出的请求是dropbox.com,对吗?请求未通过,因为dropbox不在蜘蛛的allowed_domains中。 (因此日志消息:DEBUG: Filtered offsite request to 'www.dropbox.com': <GET https://www.dropbox.com/s/>

实际有效且未过滤的一个请求导致http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0,这是Google的一个页面,而不是DropBox的。在您使用urlparse方法提出请求之前,您可能希望使用parse()检查链接的域名。

至于你的结果:第一个JSON对象

{"link": ["http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk"]}

是您在yield item方法中调用parse()的地方。只有一个因为你的产量不是任何类型的循环,所以当生成器恢复执行时,它会运行下一行:return,它退出生成器。您会注意到此项目缺少您在parse_file_page()方法中填写的所有字段。这就是您不希望在parse()方法中产生任何项目的原因。

您的第二个JSON对象

{
 "status": "error", 
 "err_msg": "unknown_err", 
 "link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"], 
 "filename": "null"
}

是尝试解析其中一个Google网页的结果,就好像它是您期望的DropBox页面一样。您在parse()方法中产生了多个请求,而其中一个请求只指向dropbox.com。所有DropBox链接都被删除,因为它们不在allowed_domains中,因此您获得的唯一响应是页面上与您的xpath选择器匹配并且来自其中一个站点的另一个链接在allowed_sites中。 (这是谷歌网站管理员链接)这就是为什么你只在输出中看到parse_file_page!!!一次。

我建议您更多地了解生成器,因为它们是使用Scrapy的基本部分。 second Google result for "python generator tutorial" looks like a very good place to start