Scrapy爬虫爬行额外数据

时间:2013-12-12 06:26:09

标签: scrapy

我是scrapy的新手,只是试图抓住黑客新闻。我能够从网站获得所有链接和标题,但空标题和链接也在整个数据中被抓取。如何避免这种情况或者我在声明 xpaths 时做了一些错误。

spider.py

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

from hn.items import HnItem

class HNSpider(BaseSpider):
    name = "hn"
    allowed_domains = ["https://news.ycombinator.com/"]
    start_urls = [
        "https://news.ycombinator.com/"
    ]

    def parse(self, response):
        selector = Selector(response)
        sites = selector.xpath('//td[@class="title"]')
        items = []
        for site in sites:
            item = HnItem()
            item['title'] = site.xpath('a/text()').extract()
            item['link'] = site.xpath('a/@href').extract()
            items.append(item)
        for item in items:
            yield item

输出

2013-12-12 11:50:46+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None)
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475'],
         'title': [u'Backpacker stripped of tech gear at Auckland Airport']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://sivers.org/ws'], 'title': [u'Why was this secret?']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.theatlantic.com/politics/archive/2013/12/how-americans-were-deceived-about-cell-phone-location-data/282239/'],
         'title': [u'How Americans Were Deceived About Cell Phone Location Data']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.rockpapershotgun.com/2013/12/11/youtube-blocks-game-videos-industry-offers-help/'],
         'title': [u'YouTube Blocks Game Videos, Industry Offers Help']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html'],
         'title': [u'Prototype ergonomic mechanical keyboards']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.timmins.net/2013/12/11/how-att-verizon-and-comcast-are-working-together-to-screw-you-by-discontinuing-landline-service/'],
         'title': [u'How AT&T, Verizon, and Comcast are working together to screw you']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blog.samaltman.com/h5n1'], 'title': [u'H5N1']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.digitaltrends.com/gadgets/parents-dislike-infant-seat-ipad-mount/'],
         'title': [u'Parents Revolt Over Fisher-Price Infant Seat With Face-Level iPad Mount ']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'https://www.fsf.org/news/reform-corporate-surveillance'],
         'title': [u'Reform corporate surveillance']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://googledrive.blogspot.com/2013/12/newsheets.html?m=1'],
         'title': [u'New Google Sheets: faster, more powerful, and works offline']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blogs.marketwatch.com/thetell/2013/12/11/fidelity-now-allows-clients-to-put-bitcoins-in-iras/'],
         'title': [u'Fidelity now allows clients to put bitcoins in IRAs']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://bitmason.blogspot.ca/2013/09/what-are-containers-anyway.html'],
         'title': [u'What are Linux containers and how did they come about?']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.cbc.ca/news/canada/ottawa/canada-post-to-phase-out-urban-home-mail-delivery-1.2459618'],
         'title': [u'Canada Post to phase out urban home mail delivery']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.reuters.com/article/2013/12/11/fda-antibiotic-idUSL3N0JQ36T20131211'],
         'title': [u'U.S. FDA to phase out some antibiotic use in animal production']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'https://lists.gnu.org/archive/html/guix-devel/2013-12/msg00061.html'],
         'title': [u'GNU Guix 0.5 released']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'https://sites.google.com/site/ancientbharat/home'],
         'title': [u'Ancient Indian Texts']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.creativebloq.com/responsive-design-tools-8134180'],
         'title': [u'Responsive design tools']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/'],
         'title': [u'How I introduced a 27-year-old computer to the web']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blog.sendtoinc.com/2013/12/11/silicon-valley-internship-j1-visa/'],
         'title': [u'How to intern in Silicon Valley with a J1 visa']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'https://www.crowdtilt.com/campaigns/project-marilyn-part-i?utm_source=HackerNews&utm_medium=HNPost&utm_campaign=ProjectMarilyn'],
         'title': [u'Project Marilyn Part I: Non-Patented Cancer Pharmaceutical']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://steamcommunity.com/groups/steamuniverse#announcements/detail/1930088300965516570'],
         'title': [u'Steam Machines and Steam Controller shipping to beta participants December 13th']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blog.alexmaccaw.com/an-engineers-guide-to-stock-options'],
         'title': [u'An Engineer\u2019s guide to Stock Options']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.vim3d.com/'],
         'title': [u'Vim3D \u2013 A new 3D vi clone [video]']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://da-data.blogspot.com/2013/12/briefly-profitable-alt-coin-mining-on.html'],
         'title': [u'Briefly profitable alt-coin mining on Amazon through better code']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://blog.jetbrains.com/idea/2013/12/intellij-idea-13-brings-a-full-bag-of-goodies-to-android-developers/'],
         'title': [u'IntelliJ IDEA 13 Brings a Full Bag of Goodies to Android Developers']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://crowdmed.theresumator.com/apply/'],
         'title': [u'CrowdMed (YC W13) is hiring a VP of Marketing + Web Dev and Design Interns']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://jh3y.github.io/tyto/'], 'title': [u'Show HN: tyto']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/'],
         'title': [u'NSA uses Google cookies to pinpoint targets for hacking']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'https://access.redhat.com/site/products/Red_Hat_Enterprise_Linux/Get-Beta?intcmp=70160000000cINoAAM'],
         'title': [u'Red Hat Enterprise Linux 7 Beta']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'http://thenextweb.com/dd/2013/12/11/digia-releases-qt-5-2-android-ios-support-previews-windows-rt-launches-qt-mobile-edition/'],
         'title': [u'Digia releases Qt 5.2 with Android and iOS support']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': [u'news2'], 'title': [u'More']}
2013-12-12 11:50:46+0530 [hn] INFO: Closing spider (finished)

您可能已从输出中注意到title[]link[]一直在重复。

如何纠正此问题。请帮忙。

1 个答案:

答案 0 :(得分:1)

这样做的方法很少,即:

  1. 通过scrapy管道(http://doc.scrapy.org/en/latest/topics/item-pipeline.html): 您可以添加简单管道,如果项目中没有标题或链接,则会丢弃该项目。
    from scrapy.exceptions import DropItem
    class DropEmptyPipeline(object):
        def process_item(self, item, spider):
            if "title" in item and "link" in item:
                return item
            else:
                raise DropItem("Missing title or link in %s" % item)
    
  2. 通过不向项目集合添加项目,它没有标题或链接:
    if "title" in item and "link" in item: 
        items.append(item)