我是scrapy的新手,只是试图抓住黑客新闻。我能够从网站获得所有链接和标题,但空标题和链接也在整个数据中被抓取。如何避免这种情况或者我在声明 xpaths 时做了一些错误。
spider.py
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from hn.items import HnItem
class HNSpider(BaseSpider):
name = "hn"
allowed_domains = ["https://news.ycombinator.com/"]
start_urls = [
"https://news.ycombinator.com/"
]
def parse(self, response):
selector = Selector(response)
sites = selector.xpath('//td[@class="title"]')
items = []
for site in sites:
item = HnItem()
item['title'] = site.xpath('a/text()').extract()
item['link'] = site.xpath('a/@href').extract()
items.append(item)
for item in items:
yield item
输出
2013-12-12 11:50:46+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None)
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475'],
'title': [u'Backpacker stripped of tech gear at Auckland Airport']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://sivers.org/ws'], 'title': [u'Why was this secret?']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.theatlantic.com/politics/archive/2013/12/how-americans-were-deceived-about-cell-phone-location-data/282239/'],
'title': [u'How Americans Were Deceived About Cell Phone Location Data']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.rockpapershotgun.com/2013/12/11/youtube-blocks-game-videos-industry-offers-help/'],
'title': [u'YouTube Blocks Game Videos, Industry Offers Help']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html'],
'title': [u'Prototype ergonomic mechanical keyboards']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.timmins.net/2013/12/11/how-att-verizon-and-comcast-are-working-together-to-screw-you-by-discontinuing-landline-service/'],
'title': [u'How AT&T, Verizon, and Comcast are working together to screw you']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blog.samaltman.com/h5n1'], 'title': [u'H5N1']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.digitaltrends.com/gadgets/parents-dislike-infant-seat-ipad-mount/'],
'title': [u'Parents Revolt Over Fisher-Price Infant Seat With Face-Level iPad Mount ']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'https://www.fsf.org/news/reform-corporate-surveillance'],
'title': [u'Reform corporate surveillance']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://googledrive.blogspot.com/2013/12/newsheets.html?m=1'],
'title': [u'New Google Sheets: faster, more powerful, and works offline']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blogs.marketwatch.com/thetell/2013/12/11/fidelity-now-allows-clients-to-put-bitcoins-in-iras/'],
'title': [u'Fidelity now allows clients to put bitcoins in IRAs']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://bitmason.blogspot.ca/2013/09/what-are-containers-anyway.html'],
'title': [u'What are Linux containers and how did they come about?']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.cbc.ca/news/canada/ottawa/canada-post-to-phase-out-urban-home-mail-delivery-1.2459618'],
'title': [u'Canada Post to phase out urban home mail delivery']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.reuters.com/article/2013/12/11/fda-antibiotic-idUSL3N0JQ36T20131211'],
'title': [u'U.S. FDA to phase out some antibiotic use in animal production']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'https://lists.gnu.org/archive/html/guix-devel/2013-12/msg00061.html'],
'title': [u'GNU Guix 0.5 released']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'https://sites.google.com/site/ancientbharat/home'],
'title': [u'Ancient Indian Texts']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.creativebloq.com/responsive-design-tools-8134180'],
'title': [u'Responsive design tools']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/'],
'title': [u'How I introduced a 27-year-old computer to the web']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blog.sendtoinc.com/2013/12/11/silicon-valley-internship-j1-visa/'],
'title': [u'How to intern in Silicon Valley with a J1 visa']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'https://www.crowdtilt.com/campaigns/project-marilyn-part-i?utm_source=HackerNews&utm_medium=HNPost&utm_campaign=ProjectMarilyn'],
'title': [u'Project Marilyn Part I: Non-Patented Cancer Pharmaceutical']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://steamcommunity.com/groups/steamuniverse#announcements/detail/1930088300965516570'],
'title': [u'Steam Machines and Steam Controller shipping to beta participants December 13th']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blog.alexmaccaw.com/an-engineers-guide-to-stock-options'],
'title': [u'An Engineer\u2019s guide to Stock Options']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.vim3d.com/'],
'title': [u'Vim3D \u2013 A new 3D vi clone [video]']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://da-data.blogspot.com/2013/12/briefly-profitable-alt-coin-mining-on.html'],
'title': [u'Briefly profitable alt-coin mining on Amazon through better code']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://blog.jetbrains.com/idea/2013/12/intellij-idea-13-brings-a-full-bag-of-goodies-to-android-developers/'],
'title': [u'IntelliJ IDEA 13 Brings a Full Bag of Goodies to Android Developers']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://crowdmed.theresumator.com/apply/'],
'title': [u'CrowdMed (YC W13) is hiring a VP of Marketing + Web Dev and Design Interns']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://jh3y.github.io/tyto/'], 'title': [u'Show HN: tyto']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://www.washingtonpost.com/blogs/the-switch/wp/2013/12/10/nsa-uses-google-cookies-to-pinpoint-targets-for-hacking/'],
'title': [u'NSA uses Google cookies to pinpoint targets for hacking']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'https://access.redhat.com/site/products/Red_Hat_Enterprise_Linux/Get-Beta?intcmp=70160000000cINoAAM'],
'title': [u'Red Hat Enterprise Linux 7 Beta']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [], 'title': []}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'http://thenextweb.com/dd/2013/12/11/digia-releases-qt-5-2-android-ios-support-previews-windows-rt-launches-qt-mobile-edition/'],
'title': [u'Digia releases Qt 5.2 with Android and iOS support']}
2013-12-12 11:50:46+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
{'link': [u'news2'], 'title': [u'More']}
2013-12-12 11:50:46+0530 [hn] INFO: Closing spider (finished)
您可能已从输出中注意到title[]
和link[]
一直在重复。
如何纠正此问题。请帮忙。
答案 0 :(得分:1)
这样做的方法很少,即:
from scrapy.exceptions import DropItem
class DropEmptyPipeline(object):
def process_item(self, item, spider):
if "title" in item and "link" in item:
return item
else:
raise DropItem("Missing title or link in %s" % item)
if "title" in item and "link" in item:
items.append(item)
醇>