Question

我最近开始使用Scrapy，并且我尝试使用＆＃34; XMLFeedSpider＆＃34;提取和加载xml页面中的页面。但问题是它返回了一个错误：＆＃34; IndexError：列表索引超出范围＆＃34;。

我试图收集并加载位于此地址的所有产品页面：
＆＃34; http://www.example.com/feed.xml＆＃34;

我的蜘蛛：

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['http://www.example.com']

    start_urls = [      
        'http://www.example.com/feed.xml'
    ]   

    itertag = 'loc'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

Answer 1

这是您的XML输入的开始方式：

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url><loc>http://www.example.htm</loc></url>
<url><loc>http://www.example.htm</loc></url>
(...)

当XML文档使用命名空间时，使用（默认）迭代器XMLFeedSpider时，iternodes中存在一个错误。请参阅scrapy-users mailinglist中的this archived discussion。

这个蜘蛛可以工作，将迭代器更改为xml，你可以在这里使用前缀http://www.sitemaps.org/schemas/sitemap/0.9引用命名空间n（它可能是真的），并使用这个命名空间要查找的代码的前缀，请n:loc：

from scrapy.spiders import XMLFeedSpider

class PartySpider(XMLFeedSpider):
    name = 'example'
    allowed_domains = ['example.com']

    start_urls = [      
        'http://www.example.com/example.xml'
    ]   

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:loc'
    iterator = 'xml'

    def parse_node(self, response, node): 
        self.logger.info('Hi, this is a <%s> node!: %s', self.itertag,''.join(node.extract()))

如何使用scrapy从XML中提取URL - XMLFeedSpider？

1 个答案: