Question

我是新手，尝试从其他页面进行爬网。

以下内容基于硒。它将在100页后粉碎，速度为每秒1页。我认为这可能是网站的限制。为了避免这种情况，请禁用Cookie或使用假标题。到目前为止，我还没有尝试过。我真的需要一些很好的建议。非常感谢！

class test1113(scrapy.Spider):
    name = "chrome"

    def __init__(self):
        chromeOptions = webdriver.ChromeOptions()
        prefs = {"profile.managed_default_content_settings.images": 2}
        chromeOptions.add_experimental_option("prefs", prefs)
        self.driver = webdriver.Chrome('D:\chrome\chromedriver.exe', chrome_options=chromeOptions)
        # self.driver = webdriver.Chrome('D:\chrome\chromedriver.exe')

    def start_requests(self):
        urls = [
            'https://www.emsc-csem.org/Earthquake/?view=1'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next_page = self.driver.find_elements_by_css_selector('*[alt="Next page"]')[0]
            try:
                tables = response.xpath('//table')
                table = tables[3]
                rows = table.xpath('//tr[@id]')
                for row in rows:
                    temp = row.xpath('td//text()').extract()
                    if (12 == len(temp) and temp[0] == 'earthquake'):
                        yield {
                            'time:': ' '.join(temp[1].split()),
                            'latitude:': ' '.join((temp[3] + temp[4]).split()),
                            'longitude:': ' '.join((temp[5] + temp[6]).split()),
                            'depth:': ' '.join(temp[7].split()),
                            'magnitude:': ' '.join(temp[9].split()),
                            'region:': ' '.join(temp[10].split()),
                            'update time:': ' '.join(temp[11].split()),
                        }
                        print('time:', ' '.join(temp[1].split()))
                next_page.click()
            except:
                break
        self.driver.close()

Answer 1

嗯，这不是错误，Scrapy确实工作正常。

我在这里看到的主要问题是该网页上的内容需要一些时间来加载，因此您的脚本将需要等待，您可以使用Selenium或Splash来实现。

我的建议是使用Splash，它是为与Scrapy一起使用而开发的，因此，这里有一些线索可以帮助您。

Scrapy-Splash的官方GitHub存储库：   https://github.com/scrapy-plugins/scrapy-splash

文档：https://splash.readthedocs.io/en/stable/faq.html

这是Splash与Scrapy集成的示例：

https://blog.scrapinghub.com/2015/03/02/handling-javascript-in-scrapy-with-splash

希望这会有所帮助。

使用Scrapy 1.6.0进行了调试：Crawled（200）

1 个答案: