Question

我正试图从该网站上抓取所有医院数据。 https://www.german-hospital-directory.com/search/Bundesland/Baden-Wuerttemberg.html。

查看请求后，它发出一个表单请求。而且无法通过 scrapy shell

访问

在响应有效负载中，它提供了整个html内容。如何提取每个医院数据（例如URL，NAME，IMAGE）并遍历所有医院。任何帮助将不胜感激，因为我是新手。

我是否需要使用硒，还是可以通过使用scrapy来实现这一目标。

Answer 1

您首先需要GET的URL（才能接收Cookie）：https://www.german-hospital-directory.com/search/Bundesland/Baden-Wuerttemberg.html

但是接下来您需要GET这个URL https://www.german-hospital-directory.com/search/_files/main-search/Suchergebnis.jsf

类似这样的东西：

start_urls = ['https://www.german-hospital-directory.com/search/Bundesland/Baden-Wuerttemberg.html']

def parse(self, response):

    yield scrapy.Request(

        url="https://www.german-hospital-directory.com/search/_files/main-search/Suchergebnis.jsf",
        callback=self.parse_hospitals
    )

def parse_hospitals(self, response):
    #here you have hospitals data
    .....

使用Scrapy

1 个答案: