Question

我要刮除https://www.gpw.pl/spolki中的所有公司名称，此外，我想按“Pokażwięcej...”（用英语显示更多）以刮除所有公司名称。

我的初始代码是：

import scrapy 
from scrapy.http.request import Request

from gpw_scraping.items import FinalItem

class ScrapeMovies(scrapy.Spider):
    name='GpwScraping'

    start_urls = [
        'https://www.gpw.pl/spolki'
    ]


    def parse(self, response):
        for row in response.xpath('//tbody[@id="search-result"]//tr'):

            item = FinalItem()
            item['name'] = row.xpath('//tbody[@id="search-result"]//tr/td/small/text()').extract_first()

            yield scrapy.Request( url=response.urljoin(profile_url), callback=self.parse_profile, meta={"item": item } )

        next_page_url = response.xpath('//html/body/section[2]/div[2]/div/div/div/div[3]/a').extract_first()
        if next_page_url:
           yield scrapy.Request( url=response.urljoin(next_page_url), callback=self.parse )


        yield item

但是最后我仍然遇到以下错误：

[<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.>]

如果我想为所有公司的所有名称提供一个csv，我该如何实现？

我在做什么错，我的意思是这个网站只是在阻止我抓取游戏吗？

编辑：我的最佳猜测是该网站阻止了所有网络爬虫，我尝试使用其他IP地址，但没有任何帮助。

顺便说一句：如果您对此问题投反对票，请毫不犹豫地写下原因:)

Answer 1

是的，该网站可能会阻止您。

尝试启用Autothrottle feature，以免太重访问网站。

您也可以尝试将user-agent设置为different value。例如

custom_settings = {
    'DEFAULT_REQUEST_HEADERS': {
        'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36',
    }
}

如果这些都不对您有帮助，请考虑使用代理或VPN。

与对方的连接已丢失-网络抓取

1 个答案: