scrapy XMLFeedSpider解析xml组指示符

时间:2017-09-21 08:39:20

标签: xml scrapy

我在Scrapy中遇到了一些xpath选择器的问题。我无法解析媒体标记。你能帮助我,一些想法,一些示例代码。谢谢 这是我的蜘蛛

import scrapy
from scrapy.spiders import XMLFeedSpider
from crawler.items import News

class CNNSpider(XMLFeedSpider):
    name = "cnn"
    start_urls = [
        'http://rss.cnn.com/rss/edition.rss', # Top stories
        #'http://rss.cnn.com/rss/cnn_latest.rss', # most recerent
    ]
    iterator = 'iternodes'  # This is actually unnecessary, since it's the default value
    itertag = 'item'

    def parse_node(self, response, node):
        item = News()
        item['title'] = node.xpath('./title/text()').extract()
        item['description'] = node.xpath('./description/text()').extract()
        item['link'] = node.xpath('./link/text()').extract()
        item['media'] = node.xpath("./media:group/media:content/@url").extract()
        item['pubDate'] = node.xpath('./pubDate/text()').extract()
        print item['media']

我的xml Feed:

<item>
    <title><![CDATA[More than 200 dead in Mexico quake, buildings toppled]]></title>
    <link>http://www.cnn.com/collections/mexico-city-earthquake-intl/</link>
    <guid isPermaLink="true">http://www.cnn.com/collections/mexico-city-earthquake-intl/</guid>
    <pubDate>Wed, 20 Sep 2017 10:03:24 GMT</pubDate>
    <media:group>
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-super-169.jpg" height="619" width="1100" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-large-11.jpg" height="300" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-large-gallery.jpg" height="552" width="414" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-video-synd-2.jpg" height="480" width="640" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-live-video.jpg" height="324" width="576" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-gallery.jpg" height="360" width="270" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-story-body.jpg" height="169" width="300" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-assign.jpg" height="186" width="248" />
        <media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-hp-video.jpg" height="144" width="256" />
    </media:group>
</item>

1 个答案:

答案 0 :(得分:0)

您需要在Xpath下面使用

item['media'] = node.xpath("./*[local-name()='group']/*[local-name()='content']/@url").extract()

基本上问题是节点正在使用名称空间。或者您可以在parse_node函数中注册命名空间并使其正常工作

node.register_namespace("media", "http://search.yahoo.com/mrss/")
item['media'] = node.xpath("./media:group/media:content/@url").extract()