为什么Scrapy会返回重复的结果?

时间:2014-11-15 00:07:54

标签: python scrapy web-crawler

我正在尝试scrapy并遇到一些问题。问题是我的脚本返回重复的结果。我试图从父页面抓取网址并按照每个网址获取相关日期。在抓取每个嵌套的url后,它似乎会再次从父页面输出url列表。

这是脚本:


    import scrapy
    from aeon.items import AeonItem
    from scrapy.http.request import Request

    class AeonSpider(scrapy.Spider):
        name = "aeon"
        allowed_domains = ["aeon.co"]
        start_urls = [
            "http://aeon.co/magazine/technology"
        ]

        def parse(self, response):
            items = []
            for sel in response.xpath('//*[@id="latestPosts"]'):
                item = AeonItem()
                item['primary_url'] = sel.xpath('div/div/div/a/@href').extract()    

                for each in item['primary_url']:
                    yield Request(each, callback=self.parse_next_page,meta={'item':item})

        def parse_next_page(self, response):
            for sel in response.xpath('//*[@id="top"]'):
                item = response.meta['item']
                item['comments'] =  sel.xpath('div[5]/div[3]/div[2]/div/p/em/span[@class="instapaper_datepublished"]/text()').extract()
                return item

这是json输出:


    {"comments": ["13 February 2014"], "primary_url": ["http://aeon.co/magazine/science/the-search-for-quantum-gravity/", "http://aeon.co/magazine/philosophy/should-generation-ted-take-a-more-sceptical-view/", "http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/", "http://aeon.co/video/technology/analogue-people-in-a-digital-age-a-short-film-about-technology/", "http://aeon.co/video/technology/boxa-short-film-about-projection-mapping/", "http://aeon.co/video/technology/how-to-sharpen-pencils-a-short-film-about-a-master-artisan/", "http://aeon.co/magazine/technology/do-we-want-minority-report-policing/", "http://aeon.co/magazine/health/can-you-have-self-worth-without-self-love/", "http://aeon.co/magazine/technology/i-learnt-to-survive-like-an-11th-century-farmer/", "http://aeon.co/magazine/technology/can-tiny-plankton-help-reverse-climate-change/", "http://aeon.co/magazine/technology/are-halophytes-the-crop-of-the-future/", "http://aeon.co/magazine/technology/how-will-sexbots-change-human-relationships/", "http://aeon.co/video/technology/robotic-cheetah-a-short-film-about-biomimetic-robotics/", "http://aeon.co/video/technology/internet-archive-a-short-film-about-accessing-knowledge/", "http://aeon.co/video/technology/a-tiny-planet-a-short-film-about-wondrous-video-technology/", "http://aeon.co/magazine/culture/there-is-fortuitous-beauty-in-a-brute-force-attack/", "http://aeon.co/magazine/technology/can-we-design-systems-to-automate-ethics/", "http://aeon.co/magazine/technology/before-minecraft-or-snapchat-there-was-micromuse/", "http://aeon.co/magazine/technology/meet-darpas-new-generation-of-humanoid-robots/", "http://aeon.co/magazine/technology/the-problem-with-too-much-information/", "http://aeon.co/magazine/technology/can-nyc-be-completely-self-reliant/", "http://aeon.co/video/technology/theo-a-short-film-about-the-wind-eating-strandbeest/", "http://aeon.co/video/technology/terminal-a-short-film-about-the-mechanical-ballet-of-cargo/", "http://aeon.co/video/technology/metropolis-ii-a-short-film-about-the-city-of-tomorrow/", "http://aeon.co/magazine/culture/digital-art-should-be-about-possibilities-not-technicalities/", "http://aeon.co/magazine/society/can-sustainability-really-hope-to-beat-consumerism/", "http://aeon.co/magazine/technology/is-technology-making-the-world-too-complex/", "http://aeon.co/magazine/culture/creepypasta-is-how-the-internet-learns-our-fears/", "http://aeon.co/magazine/technology/virtual-afterlives-will-transform-humanity/", "http://aeon.co/magazine/technology/what-will-happen-to-my-online-identity-when-i-die/", "http://aeon.co/magazine/society/what-does-silicon-valley-tell-us-about-innovation/", "http://aeon.co/magazine/technology/the-rise-of-biotechnology-and-the-loss-of-scientific-neutrality/", "http://aeon.co/magazine/culture/why-i-gave-up-living-in-an-off-grid-commune/"]}
    {"comments": ["31 January 2014"], "primary_url": ["http://aeon.co/magazine/science/the-search-for-quantum-gravity/", "http://aeon.co/magazine/philosophy/should-generation-ted-take-a-more-sceptical-view/", "http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/", "http://aeon.co/video/technology/analogue-people-in-a-digital-age-a-short-film-about-technology/", "http://aeon.co/video/technology/boxa-short-film-about-projection-mapping/", "http://aeon.co/video/technology/how-to-sharpen-pencils-a-short-film-about-a-master-artisan/", "http://aeon.co/magazine/technology/do-we-want-minority-report-policing/", "http://aeon.co/magazine/health/can-you-have-self-worth-without-self-love/", "http://aeon.co/magazine/technology/i-learnt-to-survive-like-an-11th-century-farmer/", "http://aeon.co/magazine/technology/can-tiny-plankton-help-reverse-climate-change/", "http://aeon.co/magazine/technology/are-halophytes-the-crop-of-the-future/", "http://aeon.co/magazine/technology/how-will-sexbots-change-human-relationships/", "http://aeon.co/video/technology/robotic-cheetah-a-short-film-about-biomimetic-robotics/", "http://aeon.co/video/technology/internet-archive-a-short-film-about-accessing-knowledge/", "http://aeon.co/video/technology/a-tiny-planet-a-short-film-about-wondrous-video-technology/", "http://aeon.co/magazine/culture/there-is-fortuitous-beauty-in-a-brute-force-attack/", "http://aeon.co/magazine/technology/can-we-design-systems-to-automate-ethics/", "http://aeon.co/magazine/technology/before-minecraft-or-snapchat-there-was-micromuse/", "http://aeon.co/magazine/technology/meet-darpas-new-generation-of-humanoid-robots/", "http://aeon.co/magazine/technology/the-problem-with-too-much-information/", "http://aeon.co/magazine/technology/can-nyc-be-completely-self-reliant/", "http://aeon.co/video/technology/theo-a-short-film-about-the-wind-eating-strandbeest/", "http://aeon.co/video/technology/terminal-a-short-film-about-the-mechanical-ballet-of-cargo/", "http://aeon.co/video/technology/metropolis-ii-a-short-film-about-the-city-of-tomorrow/", "http://aeon.co/magazine/culture/digital-art-should-be-about-possibilities-not-technicalities/", "http://aeon.co/magazine/society/can-sustainability-really-hope-to-beat-consumerism/", "http://aeon.co/magazine/technology/is-technology-making-the-world-too-complex/", "http://aeon.co/magazine/culture/creepypasta-is-how-the-internet-learns-our-fears/", "http://aeon.co/magazine/technology/virtual-afterlives-will-transform-humanity/", "http://aeon.co/magazine/technology/what-will-happen-to-my-online-identity-when-i-die/", "http://aeon.co/magazine/society/what-does-silicon-valley-tell-us-about-innovation/", "http://aeon.co/magazine/technology/the-rise-of-biotechnology-and-the-loss-of-scientific-neutrality/", "http://aeon.co/magazine/culture/why-i-gave-up-living-in-an-off-grid-commune/"]}
    {"comments": ["12 March 2014"], "primary_url": ["http://aeon.co/magazine/science/the-search-for-quantum-gravity/", "http://aeon.co/magazine/philosophy/should-generation-ted-take-a-more-sceptical-view/", "http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/", "http://aeon.co/video/technology/analogue-people-in-a-digital-age-a-short-film-about-technology/", "http://aeon.co/video/technology/boxa-short-film-about-projection-mapping/", "http://aeon.co/video/technology/how-to-sharpen-pencils-a-short-film-about-a-master-artisan/", "http://aeon.co/magazine/technology/do-we-want-minority-report-policing/", "http://aeon.co/magazine/health/can-you-have-self-worth-without-self-love/", "http://aeon.co/magazine/technology/i-learnt-to-survive-like-an-11th-century-farmer/", "http://aeon.co/magazine/technology/can-tiny-plankton-help-reverse-climate-change/", "http://aeon.co/magazine/technology/are-halophytes-the-crop-of-the-future/", "http://aeon.co/magazine/technology/how-will-sexbots-change-human-relationships/", "http://aeon.co/video/technology/robotic-cheetah-a-short-film-about-biomimetic-robotics/", "http://aeon.co/video/technology/internet-archive-a-short-film-about-accessing-knowledge/", "http://aeon.co/video/technology/a-tiny-planet-a-short-film-about-wondrous-video-technology/", "http://aeon.co/magazine/culture/there-is-fortuitous-beauty-in-a-brute-force-attack/", "http://aeon.co/magazine/technology/can-we-design-systems-to-automate-ethics/", "http://aeon.co/magazine/technology/before-minecraft-or-snapchat-there-was-micromuse/", "http://aeon.co/magazine/technology/meet-darpas-new-generation-of-humanoid-robots/", "http://aeon.co/magazine/technology/the-problem-with-too-much-information/", "http://aeon.co/magazine/technology/can-nyc-be-completely-self-reliant/", "http://aeon.co/video/technology/theo-a-short-film-about-the-wind-eating-strandbeest/", "http://aeon.co/video/technology/terminal-a-short-film-about-the-mechanical-ballet-of-cargo/", "http://aeon.co/video/technology/metropolis-ii-a-short-film-about-the-city-of-tomorrow/", "http://aeon.co/magazine/culture/digital-art-should-be-about-possibilities-not-technicalities/", "http://aeon.co/magazine/society/can-sustainability-really-hope-to-beat-consumerism/", "http://aeon.co/magazine/technology/is-technology-making-the-world-too-complex/", "http://aeon.co/magazine/culture/creepypasta-is-how-the-internet-learns-our-fears/", "http://aeon.co/magazine/technology/virtual-afterlives-will-transform-humanity/", "http://aeon.co/magazine/technology/what-will-happen-to-my-online-identity-when-i-die/", "http://aeon.co/magazine/society/what-does-silicon-valley-tell-us-about-innovation/", "http://aeon.co/magazine/technology/the-rise-of-biotechnology-and-the-loss-of-scientific-neutrality/", "http://aeon.co/magazine/culture/why-i-gave-up-living-in-an-off-grid-commune/"]}
    {"comments": ["31 March 2014"], "primary_url": ["http://aeon.co/magazine/science/the-search-for-quantum-gravity/", "http://aeon.co/magazine/philosophy/should-generation-ted-take-a-more-sceptical-view/", "http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/", "http://aeon.co/video/technology/analogue-people-in-a-digital-age-a-short-film-about-technology/", "http://aeon.co/video/technology/boxa-short-film-about-projection-mapping/", "http://aeon.co/video/technology/how-to-sharpen-pencils-a-short-film-about-a-master-artisan/", "http://aeon.co/magazine/technology/do-we-want-minority-report-policing/", "http://aeon.co/magazine/health/can-you-have-self-worth-without-self-love/", "http://aeon.co/magazine/technology/i-learnt-to-survive-like-an-11th-century-farmer/", "http://aeon.co/magazine/technology/can-tiny-plankton-help-reverse-climate-change/", "http://aeon.co/magazine/technology/are-halophytes-the-crop-of-the-future/", "http://aeon.co/magazine/technology/how-will-sexbots-change-human-relationships/", "http://aeon.co/video/technology/robotic-cheetah-a-short-film-about-biomimetic-robotics/", "http://aeon.co/video/technology/internet-archive-a-short-film-about-accessing-knowledge/", "http://aeon.co/video/technology/a-tiny-planet-a-short-film-about-wondrous-video-technology/", "http://aeon.co/magazine/culture/there-is-fortuitous-beauty-in-a-brute-force-attack/", "http://aeon.co/magazine/technology/can-we-design-systems-to-automate-ethics/", "http://aeon.co/magazine/technology/before-minecraft-or-snapchat-there-was-micromuse/", "http://aeon.co/magazine/technology/meet-darpas-new-generation-of-humanoid-robots/", "http://aeon.co/magazine/technology/the-problem-with-too-much-information/", "http://aeon.co/magazine/technology/can-nyc-be-completely-self-reliant/", "http://aeon.co/video/technology/theo-a-short-film-about-the-wind-eating-strandbeest/", "http://aeon.co/video/technology/terminal-a-short-film-about-the-mechanical-ballet-of-cargo/", "http://aeon.co/video/technology/metropolis-ii-a-short-film-about-the-city-of-tomorrow/", "http://aeon.co/magazine/culture/digital-art-should-be-about-possibilities-not-technicalities/", "http://aeon.co/magazine/society/can-sustainability-really-hope-to-beat-consumerism/", "http://aeon.co/magazine/technology/is-technology-making-the-world-too-complex/", "http://aeon.co/magazine/culture/creepypasta-is-how-the-internet-learns-our-fears/", "http://aeon.co/magazine/technology/virtual-afterlives-will-transform-humanity/", "http://aeon.co/magazine/technology/what-will-happen-to-my-online-identity-when-i-die/", "http://aeon.co/magazine/society/what-does-silicon-valley-tell-us-about-innovation/", "http://aeon.co/magazine/technology/the-rise-of-biotechnology-and-the-loss-of-scientific-neutrality/", "http://aeon.co/magazine/culture/why-i-gave-up-living-in-an-off-grid-commune/"]}
    {"comments": ["30 May 2014"], "primary_url": ["http://aeon.co/magazine/science/the-search-for-quantum-gravity/", "http://aeon.co/magazine/philosophy/should-generation-ted-take-a-more-sceptical-view/", "http://aeon.co/magazine/technology/the-elon-musk-interview-on-mars/", "http://aeon.co/video/technology/analogue-people-in-a-digital-age-a-short-film-about-technology/", "http://aeon.co/video/technology/boxa-short-film-about-projection-mapping/", "http://aeon.co/video/technology/how-to-sharpen-pencils-a-short-film-about-a-master-artisan/", "http://aeon.co/magazine/technology/do-we-want-minority-report-policing/", "http://aeon.co/magazine/health/can-you-have-self-worth-without-self-love/", "http://aeon.co/magazine/technology/i-learnt-to-survive-like-an-11th-century-farmer/", "http://aeon.co/magazine/technology/can-tiny-plankton-help-reverse-climate-change/", "http://aeon.co/magazine/technology/are-halophytes-the-crop-of-the-future/", "http://aeon.co/magazine/technology/how-will-sexbots-change-human-relationships/", "http://aeon.co/video/technology/robotic-cheetah-a-short-film-about-biomimetic-robotics/", "http://aeon.co/video/technology/internet-archive-a-short-film-about-accessing-knowledge/", "http://aeon.co/video/technology/a-tiny-planet-a-short-film-about-wondrous-video-technology/", "http://aeon.co/magazine/culture/there-is-fortuitous-beauty-in-a-brute-force-attack/", "http://aeon.co/magazine/technology/can-we-design-systems-to-automate-ethics/", "http://aeon.co/magazine/technology/before-minecraft-or-snapchat-there-was-micromuse/", "http://aeon.co/magazine/technology/meet-darpas-new-generation-of-humanoid-robots/", "http://aeon.co/magazine/technology/the-problem-with-too-much-information/", "http://aeon.co/magazine/technology/can-nyc-be-completely-self-reliant/", "http://aeon.co/video/technology/theo-a-short-film-about-the-wind-eating-strandbeest/", "http://aeon.co/video/technology/terminal-a-short-film-about-the-mechanical-ballet-of-cargo/", "http://aeon.co/video/technology/metropolis-ii-a-short-film-about-the-city-of-tomorrow/", "http://aeon.co/magazine/culture/digital-art-should-be-about-possibilities-not-technicalities/", "http://aeon.co/magazine/society/can-sustainability-really-hope-to-beat-consumerism/", "http://aeon.co/magazine/technology/is-technology-making-the-world-too-complex/", "http://aeon.co/magazine/culture/creepypasta-is-how-the-internet-learns-our-fears/", "http://aeon.co/magazine/technology/virtual-afterlives-will-transform-humanity/", "http://aeon.co/magazine/technology/what-will-happen-to-my-online-identity-when-i-die/", "http://aeon.co/magazine/society/what-does-silicon-valley-tell-us-about-innovation/", "http://aeon.co/magazine/technology/the-rise-of-biotechnology-and-the-loss-of-scientific-neutrality/", "http://aeon.co/magazine/culture/why-i-gave-up-living-in-an-off-grid-commune/"]}

重申一下,我无法从父页面输出一个网址列表,也无法从每个嵌套网址中输出一个相应日期列表。我是scrapy和python的新手,所以希望有人可以指出我正确的方向。

1 个答案:

答案 0 :(得分:0)

你的代码正在迭代错误的东西。

response.xpath('//*[@id="latestPosts"]')位返回一个列表,其中只包含一个包含所有文章链接的选择器。

尝试将循环更改为:

for sel in response.xpath('//*[@id="latestPosts"]/div/div/div'):
    item = AeonItem()
    item['primary_url'] = sel.xpath('./a/@href').extract()

    ...

您可能希望在其他回调中应用相同的更改 - 我将为您留下余下的乐趣。 =)

了解更多: