Question

我要抓取的xml feed大约有数千个项目。我想知道是否有一种方法可以拆分负载或另一种方法来显着减少运行时间。当前需要花费两分钟来迭代下面链接中的所有xml。任何建议都将不胜感激。

示例：https://www.cityblueshop.com/sitemap_products_1.xml

from scrapy.spiders import XMLFeedSpider
from learning.items import TestItem
class MySpider(XMLFeedSpider):
    name = 'testing'
    allowed_domains = ['www.cityblueshop.com']
    start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml'] 

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    iterator = 'xml'


    def parse_node(self, response, node):

        item = TestItem()
        item['url'] = node.xpath('.//n:loc/text()').extract()


        return item

所有项目的运行时间为2分钟。有什么方法可以使用Scrapy使其更快？

Answer 1

我在本地测试了以下蜘蛛：

from scrapy.spiders import XMLFeedSpider

class MySpider(XMLFeedSpider):
    name = 'testing'
    allowed_domains = ['www.cityblueshop.com']
    start_urls = ['https://www.cityblueshop.com/sitemap_products_1.xml']

    namespaces = [('n', 'http://www.sitemaps.org/schemas/sitemap/0.9')]
    itertag = 'n:url'
    iterator = 'xml'


    def parse_node(self, response, node):
        yield {'url': node.xpath('.//n:loc/text()').get()}

运行不到3秒，包括Scrapy内核启动和所有操作。

请确保不要将时间浪费在其他地方，例如在learning模块中，您可以从中导入项目子类。

Answer 2

尝试增加CONCURRENT_REQUESTS，CONCURRENT_REQUESTS_PER_DOMAIN，CONCURRENT_REQUESTS_PER_IP，例如：https://doc.scrapy.org/en/latest/topics/settings.html#concurrent-requests-per-domain 但是请记住，除了高速之外，它还会导致成功率降低，例如许多429响应，禁令等。

如何使Scrapy XmlFeed Spider更快

2 个答案: