Question

我使用Scrapy从第一个网址抓取数据。
第一个网址返回包含网址列表的回复。

到目前为止对我来说还可以。我的问题是如何进一步删除此URL列表？搜索之后，我知道我可以在解析中返回一个请求，但似乎只能处理一个URL。

这是我的解析：

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]
    return scrapy.Request(list[0])
    # It works, but how can I continue b.com and c.com?

我可以这样做吗？

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        scrapy.Request(link)
        # This is wrong, though I need something like this

完整版：

import scrapy

class MySpider(scrapy.Spider):
    name = "mySpider"
    allowed_domains = ["x.com"]
    start_urls = ["http://x.com"]

    def parse(self, response):
        # Get the list of URLs, for example:
        list = ["http://a.com", "http://b.com", "http://c.com"]

        for link in list:
            scrapy.Request(link)
            # This is wrong, though I need something like this

Answer 1

我认为您正在寻找的是收益率声明：

def parse(self, response):
    # Get the list of URLs, for example:
    list = ["http://a.com", "http://b.com", "http://c.com"]

    for link in list:
        request = scrapy.Request(link)
        yield request

Answer 2

为此，您需要创建scrapy.spider的子类并定义要开始的URL列表。然后，Scrapy将自动跟踪它找到的链接。

做这样的事情：

import scrapy

class YourSpider(scrapy.Spider):
    name = "your_spider"
    allowed_domains = ["a.com", "b.com", "c.com"]
    start_urls = [
        "http://a.com/",
        "http://b.com/",
        "http://c.com/",
    ]

    def parse(self, response):
        # do whatever you want
        pass

您可以找到有关Scrapy official documentation的更多信息。

Answer 3

# within your parse method:

urlList = response.xpath('//a/@href').extract()  
print(urlList) #to see the list of URLs 
for url in urlList:
    yield scrapy.Request(url, callback=self.parse)

这应该有效

Scrapy - 使用第一个URL的结果来刮取多个URL

3 个答案: