Scrapy - 使用第一个URL的结果来刮取多个URL

时间:2015-03-11 08:32:53

标签: python scrapy scrapy-spider

  1. 我使用Scrapy从第一个网址抓取数据。
  2. 第一个网址返回包含网址列表的回复。
  3. 到目前为止对我来说还可以。我的问题是如何进一步删除此URL列表?搜索之后,我知道我可以在解析中返回一个请求,但似乎只能处理一个URL。

    这是我的解析:

    def parse(self, response):
        # Get the list of URLs, for example:
        list = ["http://a.com", "http://b.com", "http://c.com"]
        return scrapy.Request(list[0])
        # It works, but how can I continue b.com and c.com?
    

    我可以这样做吗?

    def parse(self, response):
        # Get the list of URLs, for example:
        list = ["http://a.com", "http://b.com", "http://c.com"]
    
        for link in list:
            scrapy.Request(link)
            # This is wrong, though I need something like this
    

    完整版:

    import scrapy
    
    class MySpider(scrapy.Spider):
        name = "mySpider"
        allowed_domains = ["x.com"]
        start_urls = ["http://x.com"]
    
        def parse(self, response):
            # Get the list of URLs, for example:
            list = ["http://a.com", "http://b.com", "http://c.com"]
    
            for link in list:
                scrapy.Request(link)
                # This is wrong, though I need something like this
    

3 个答案:

答案 0 :(得分:5)

我认为您正在寻找的是收益率声明:

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        request = scrapy.Request(link)
        yield request

答案 1 :(得分:1)

为此,您需要创建scrapy.spider的子类并定义要开始的URL列表。然后,Scrapy将自动跟踪它找到的链接。

做这样的事情:

import scrapy

class YourSpider(scrapy.Spider):
    name = "your_spider"
    allowed_domains = ["a.com", "b.com", "c.com"]
    start_urls = [
        "http://a.com/",
        "http://b.com/",
        "http://c.com/",
    ]

    def parse(self, response):
        # do whatever you want
        pass

您可以找到有关Scrapy official documentation的更多信息。

答案 2 :(得分:0)

# within your parse method:

urlList = response.xpath('//a/@href').extract()  
print(urlList) #to see the list of URLs 
for url in urlList:
    yield scrapy.Request(url, callback=self.parse)

这应该有效