我正在尝试从以下站点解析文章:https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t
首次访问该网站时,系统会提示您接受cookie。似乎他们将同意书存储在DSGVO_ZUSAGE_V1:true
中,因为当我这样刮时,它会起作用:
def start_requests(self):
urls = [
'https://derstandard.at/r2000026008978/Wirtschaftspolitik?_chron=t'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse,
cookies={'DSGVO_ZUSAGE_V1':'true'})
def parse(self, response):
base_query = '//div[contains(@class,"contentLeft")]/ul[contains(@class, "stories")]/li'
articles = response.xpath(base_query)
for index, value in enumerate(articles):
article_url = validateResponse(response,
base_query + '/div[contains(@class,"text")]/h3/a/@href',
index)
request = scrapy.Request(response.urljoin(article_url), callback=self.parseSingleArticle, cookies={'DSGVO_ZUSAGE_V1':'true'})
yield request
def parseSingleArticle(self, response):
article_content = ''
article_date = validateResponse(response, "//h6[contains(@class,'info')]/span[contains(@class,'date')]/text()")
article_title = validateResponse(response, "//h1[contains(@itemprop,'headline')]/text()")
query = "//div[contains(@class,'copytext')]//child::text()"
article_content_response = response.xpath(query)
for index, value in enumerate(article_content_response):
article_content += " " + validateResponse(response, query, index)
yield self.article_to_pipeline(article_content, response.url, article_date, article_title)
def article_to_pipeline(self, article_content, url, article_date, article_title):
article_item = ArticleItem()
# some other stuff
return article_item
对于某些文章,这很完美。通过调试:
2018-11-24 10:01:58 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000091420120/60-Stunden-pro-Woche-Fuer-Unternehmer-ist-das-die-Realitaet>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
我得到了本文想要的一切。
但是,有些文章不起作用。例如,这不返回任何内容:
2018-11-24 10:02:07 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://derstandard.at/2000090136491/Lufthunderter-fuer-E-Autos-faellt>
Cookie: DSGVO_ZUSAGE_V1=true; MGUID=GUID=e2b4bb79-069f-4778-8062-f34d7b3d2b9d&Timestamp=2018-11-24T09:01:52&DetectedVersion=Web&Version=&Hash=4F3A498256857F21390A3DD636588B38
当我通过Chrome浏览器访问此网站时,也会收到“接受Cookie提示”。尽管之前已经在其网站上的其他页面上接受了此Cookie。当我这样做时,他们再次将确认保存在DSGVO_ZUSAGE_V1:true
和MGUID
下(从chrome中删除它们会返回“接受cookie提示”)。
有人有什么想法吗?我尝试了其他cookie,但是MGUID
和DSGVO_ZUSAGE_V1
是唯一可以发挥作用的cookie。