我在Scrapy中遇到了一些xpath选择器的问题。我无法解析媒体标记。你能帮助我,一些想法,一些示例代码。谢谢 这是我的蜘蛛
import scrapy
from scrapy.spiders import XMLFeedSpider
from crawler.items import News
class CNNSpider(XMLFeedSpider):
name = "cnn"
start_urls = [
'http://rss.cnn.com/rss/edition.rss', # Top stories
#'http://rss.cnn.com/rss/cnn_latest.rss', # most recerent
]
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def parse_node(self, response, node):
item = News()
item['title'] = node.xpath('./title/text()').extract()
item['description'] = node.xpath('./description/text()').extract()
item['link'] = node.xpath('./link/text()').extract()
item['media'] = node.xpath("./media:group/media:content/@url").extract()
item['pubDate'] = node.xpath('./pubDate/text()').extract()
print item['media']
我的xml Feed:
<item>
<title><![CDATA[More than 200 dead in Mexico quake, buildings toppled]]></title>
<link>http://www.cnn.com/collections/mexico-city-earthquake-intl/</link>
<guid isPermaLink="true">http://www.cnn.com/collections/mexico-city-earthquake-intl/</guid>
<pubDate>Wed, 20 Sep 2017 10:03:24 GMT</pubDate>
<media:group>
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-super-169.jpg" height="619" width="1100" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-large-11.jpg" height="300" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-large-gallery.jpg" height="552" width="414" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-video-synd-2.jpg" height="480" width="640" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-live-video.jpg" height="324" width="576" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-vertical-gallery.jpg" height="360" width="270" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-story-body.jpg" height="169" width="300" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-t1-main.jpg" height="250" width="250" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-assign.jpg" height="186" width="248" />
<media:content medium="image" url="http://i2.cdn.turner.com/cnnnext/dam/assets/170919190244-25-mexico-earthquake-0919-hp-video.jpg" height="144" width="256" />
</media:group>
</item>
答案 0 :(得分:0)
您需要在Xpath下面使用
item['media'] = node.xpath("./*[local-name()='group']/*[local-name()='content']/@url").extract()
基本上问题是节点正在使用名称空间。或者您可以在parse_node
函数中注册命名空间并使其正常工作
node.register_namespace("media", "http://search.yahoo.com/mrss/")
item['media'] = node.xpath("./media:group/media:content/@url").extract()