Scrapy不保留某些页面的cookie

时间:2018-11-24 09:15:41

标签: python cookies web-scraping scrapy scrapy-spider

我正在尝试从以下站点解析文章:https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t

首次访问该网站时,系统会提示您接受cookie。似乎他们将同意书存储在DSGVO_ZUSAGE_V1:true中,因为当我这样刮时,它会起作用:

def start_requests(self):
    urls = [
        'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
    ]
    for url in urls:
        yield scrapy.Request(url=url, callback=self.parse,
                              cookies={'DSGVO_ZUSAGE_V1':'true'})

def parse(self, response):
    base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
    articles = response.xpath(base_query)
    for index, value in enumerate(articles):
        article_url = validateResponse(response,
                                       base_query + '/div[contains(@class,"text")]/h3/a/@href',
                                       index)
        request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
        yield request

def parseSingleArticle(self, response):
    article_content = ''
    article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
    article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
    query = "//div[contains(@class,'copytext')]//child::text()"
    article_content_response = response.xpath(query)
    for index, value in enumerate(article_content_response):
        article_content += " " + validateResponse(response, query, index)
    yield self.article_to_pipeline(article_content, response.url, article_date, article_title)

def article_to_pipeline(self, article_content, url, article_date, article_title):
    article_item = ArticleItem()
    # some other stuff
    return article_item

对于某些文章,这很完美。通过调试:

2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

我得到了本文想要的一切。

但是,有些文章不起作用。例如,这不返回任何内容:

2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38

当我通过Chrome浏览器访问此网站时,也会收到“接受Cookie提示”。尽管之前已经在其网站上的其他页面上接受了此Cookie。当我这样做时,他们再次将确认保存在DSGVO_ZUSAGE_V1:trueMGUID下(从chrome中删除它们会返回“接受cookie提示”)。

有人有什么想法吗?我尝试了其他cookie,但是MGUIDDSGVO_ZUSAGE_V1是唯一可以发挥作用的cookie。

0 个答案:

没有答案