Question

我正在尝试抓取一个网站，该网站的顶部具有语言选项下拉按钮，默认语言是英语。如下图所示。

虽然我以为我可以对数据进行爬取，但是随后我发现随机输出的JSON文件几乎开始包含unicode阿拉伯字符（我只想要英语版本的数据）。

似乎蜘蛛是通向阿拉伯语版本的页面的，URL更改了，并在两者之间添加了字符串'/ ar /'。然后，尽管我可以对URL进行一些操作以转到英语版本，但是我在网站上进行的实验还表明，只要选择任何一种语言，Cookie就会记住它，并将任何页面转到其翻译语言版本。


    import scrapy
    import re


    class MyExampleSpider(scrapy.Spider):
        name = "my_example"
        start_urls = [
            'https://www.example.org',
        ]


        def parse(self, response):

            for href in response.xpath('//li/a[re:test(@href, "/.*/causes")]/@href'):
                yield response.follow(href, self.parse_case)

            # follow pagination alphabet links
            for href in response.xpath('//li/a[re:test(@href, "/.*/causes/.*letter=A")]/@href'):
                yield response.follow(href, self.parse)


        def parse_case(self, response):

            yield {
                'case_name': response.xpath('//h1/a/text()').extract()[0],
                'causes_names': response.xpath('//h2[text()="Causes"]/following-sibling::ul[1]/li/strong/text()').extract(),
            }

以下是输出json文件的快照。

我的问题是，为什么会发生这种语言转换，以及如何解决此问题？

Scrapy搜寻器的Cookie问题：网站语言已更改

0 个答案: