Scrapy RSS Scraper

时间:2017-06-12 19:47:56

标签: xml xpath scrapy rss

我正试图从雅虎(他们的开放公司RSS Feed | https://developer.yahoo.com/finance/company.html)中搜索RSS提要

我正在尝试抓取以下网址:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX

由于某种原因,我的蜘蛛无法运行,我认为它可能与生成的XPath有关,如果没有,定义parse_item可能会有一些问题。

import scrapy
from scrapy.spiders import CrawlSpider
from YahooScrape.items import YahooScrapeItem

class Spider(CrawlSpider):
    name= "YahooScrape"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX',)

   def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = EmperyscraperItem()
        item['title'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()                #define XPath for title
        item['link'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()                 #define XPath for link
        item['description'] = response.xpath('//*[@id="collapsible"]/div[1]/div[2]/span',).extract()          #define XPath for description
        return item

代码有什么问题?如果没有,提取标题,desc和链接的正确XPath方向是什么。我是Scrapy的新手,只需要一些帮助来搞清楚它!

编辑:我已更新我的蜘蛛并将其转换为XMLFeedSpider,如下所示:

import scrapy

from scrapy.spiders import XMLFeedSpider
from YahooScrape.items import YahooScrapeItem

class Spider(XMLFeedSpider):
    name = "YahooScrape"
    allowed_domains = ["yahoo.com"]
    start_urls = ('https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX')    #Crawl BPMX
    itertag = 'item'

    def parse_node(self, response, node):
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))

        item = YahooScrapeItem()
        item['title'] = node.xpath('item/title/text()',).extract()                #define XPath for title
        item['link'] = node.xpath('item/link/text()').extract()
        item['pubDate'] = node.xpath('item/link/pubDate/text()').extract()
        item['description'] = node.xpath('item/category/text()').extract()                #define XPath for description
        return item

#Yahoo RSS feeds http://finance.yahoo.com/rss/headline?s=BPMX,APPL

现在我收到以下错误:

2017-06-13 11:25:57 [scrapy.core.engine] ERROR: Error while obtaining start requests

知道错误发生的原因吗?我的HTML路径看起来是正确的。

1 个答案:

答案 0 :(得分:2)

从我所见,CrawlSpider only works for HTML responses。因此,我建议您构建一个更简单的var myApp = angular.module("myApp", ["ui.router"]); myApp.run(function($rootScope, $state) { $rootScope.$on("$stateChangeStart", function( event, next, toParams, fromState, fromParams, options ) { //prevent back from receipt to payment and forward to product view if (next.name === 'payment' && fromState.name === 'receipt' && toParams.preventBack === "true" //additional software switch ) { $state.go('product'); } }); }); 或更专业的XMLFeedSpider

然后,您在scrapy.Spider中使用的XPath似乎是根据您的浏览器从XML / RSS提要呈现的HTML构建的。 Feed中没有parse_items*[@id="collapsible"]

请改为<div>

view-source:https://feeds.finance.yahoo.com/rss/2.0/headline?s=BPMX

工作蜘蛛示例:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<rss version="2.0">
    <channel>
        <copyright>Copyright (c) 2017 Yahoo! Inc. All rights reserved.</copyright>
        <description>Latest Financial News for BPMX</description>
        <image>
            <height>45</height>
            <link>http://finance.yahoo.com/q/h?s=BPMX</link>
            <title>Yahoo! Finance: BPMX News</title>
            <url>http://l.yimg.com/a/i/brand/purplelogo/uh/us/fin.gif</url>
            <width>144</width>
        </image>
        <item>
            <description>MENLO PARK, Calif., June 7, 2017 /PRNewswire/ -- BioPharmX Corporation (NYSE MKT: BPMX), a specialty pharmaceutical company focusing on dermatology, today announced that it will release its financial results ...</description>
            <guid isPermaLink="false">f56d5bf8-f278-37fd-9aa5-fe04b2e1fa53</guid>
            <link>https://finance.yahoo.com/news/biopharmx-report-first-quarter-financial-101500259.html?.tsrc=rss</link>
            <pubDate>Wed, 07 Jun 2017 10:15:00 +0000</pubDate>
            <title>BioPharmX to Report First Quarter Financial Results</title>
        </item>