为什么我无法使用Scrapy或Selenium将Cookie发送到网站?

时间:2018-05-05 17:28:27

标签: python cookies web-scraping scrapy scrapy-spider

首先,如果我的问题有一个非常明显的解决方案,请原谅我。我是网络抓取和Scrapy的新手。这将是我要废弃的第三个网站(如果我能找到解决以下问题的方法)。

我想要实现的目标:

是从以下网站获取产品数据:https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler

然而 根据您登录后选择的城市区域动态加载产品。

所以我想,也许我可以使用自己的帐户登录,从请求标题中获取Cookie并使用scrapy Request发送它们。问题,我想,网站不接受我发送的cookie。

我也尝试过与硒相同的程序。

  1. 打开页面

  2. 已登录

  3. 选择了城市

  4. 得到了饼干(也用腌制来保存它们以后用于scrapy但是没有用)

  5. 删除网站上的所有Cookie

  6. 刷新页面后,在步骤4中发送cookie

  7. 再次网站不接受cookies。

    注意:由于我需要每天在网站上搜索所有类别,因此我需要像scrapy这样的快速抓取解决方案。所以用Selenium刮不是我的选择。

    以下是支持我的问题的一些日志和屏幕截图。

    Request url and method

    Request headers and cookie info

    data preview after I logged in and choose a city(note the 'sid:1885' this is the store id that I want to scrape)

    this is the output of view(response) line from scrapy

    try(Reader reader = new InputStreamReader(myClass.class.getResourceAsStream("dir/myFile.json"))){...}

    记录第一行

    scrapy shell https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler
    from scrapy import Request
    mycookie = {'JSESSIONID ': 'yndMqXswzQYeUw1CsLtp9A0GBI7ZZE0yI1W0zPk4u4JJxpZES8RF!-1577658491 ', 'NSC_wjq_dt_iuuq_lbohvsvn_lxfc    ': '756ca3c16479c6cdde0681fa2edb1040d4786c1c0a6b2f3116d5fc7f605b4631d4d0f199 ','_dc_gtm_UA-1547459-1  ':'1','_ga':'GA1.3.219867582.1525198968','_gat_UA-1547459-1 ':'1','_gid':'GA1.3.1499846526.1525198968','current-currency    ':'TRY','customer':'ggB2MTVRWi76tWJwj2ZvbDa896G27N3YaH','district':'ac00a4001701ce63cc30626def','first-permission-impression    ':'1','ins-gaSSId   ':'cbf3cd92-3c71-e321-30ac-b2d89dbf3826_1525528747  ','insIsUserLoggedIn    ':'1','insTotalCartAmount187    ':'194.96   ','insUserDetails   ':'%22muharrem.akkaya96%40gmail.com%22  ','insdrSV':'285','scs':'%7B%22t%22%3A1%7D  ','spUID':'15251989688268402d4dc11.7edd9701 ','total-cart-amount    ':'120.78   '}
    req = Request('https://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011',cookies = mycookie)
    fetch(req)
    view(response)
    

    记录剩余的行

    2018-05-05 19:11:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: seleniumcrawler)
    2018-05-05 19:11:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.16299
    2018-05-05 19:11:03 [scrapy.crawler] INFO: Overridden settings: {'COOKIES_DEBUG': True, 'NEWSPIDER_MODULE': 'seleniumcrawler.spiders', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['seleniumcrawler.spiders'], 'BOT_NAME': 'seleniumcrawler', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36', 'FEED_EXPORT_ENCODING': 'utf-8'}
    2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled extensions:
    ['scrapy.extensions.telnet.TelnetConsole',
     'scrapy.extensions.corestats.CoreStats']
    2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
    ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
     'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
     'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
     'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
     'seleniumcrawler.middlewares.seleniumcrawlerDownloaderMiddleware',
     'scrapy.downloadermiddlewares.retry.RetryMiddleware',
     'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
     'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
     'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
     'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
     'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
     'scrapy.downloadermiddlewares.stats.DownloaderStats']
    2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled spider middlewares:
    ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
     'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
     'scrapy.spidermiddlewares.referer.RefererMiddleware',
     'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
     'scrapy.spidermiddlewares.depth.DepthMiddleware']
    2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled item pipelines:
    ['seleniumcrawler.pipelines.JsonPipeline',
     'seleniumcrawler.pipelines.CsvPipeline']
    2018-05-05 19:11:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
    2018-05-05 19:11:03 [scrapy.core.engine] INFO: Spider opened
    2018-05-05 19:11:03 [migros] INFO: Spider opened: migros
    2018-05-05 19:11:04 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler>
    Set-Cookie: JSESSIONID=cMTfOnFTK1dPSPF2Qdi0d1EqqCXP3HW0S00BwxOwljYjaOMcAOqE!1083904106; path=/; HttpOnly
    Set-Cookie: NSC_wjq_dt_iuuq_lbohvsvn_lxfc=0933a3df2cf252c6b4bd9a5784157b04f2a0c6e4b29bff73d54a79d474fdc48e85bdc9ec;path=/;secure;httponly
    

    那么我该如何克服这种cookie情况?

1 个答案:

答案 0 :(得分:0)

似乎通过您的Scrapy代码正确发送了cookie,据我所知,问题出在密钥JSESSIONID的cookie值上。

当我创建自己的会话时,将我的城市设置为“ AFYON-Akmescit”,并获得该会话ID,我会按预期获得sid 1885 for AFYON-Akmescit,但是当我尝试使用自己的会话时或任何其他损坏的会话ID(通过随机更改一个字符而损坏),我收到sid 193。因此,以某种方式,城市ID 193是默认的,它不接受您的JSESSIONID值,而不是cookie信息本身。

无论如何,作为回答问题的另一方面,您一定不应该在抓取过程中将会话ID用作可靠的标识来源,您可能还希望自动执行身份验证过程。