如何绕过Cloudflare和reCAPTCHA获取页面内容

时间:2018-07-13 09:43:23

标签: python web-scraping beautifulsoup

我想带一个代理页面。我进入cfscrapy页面,并通过Cloudflare(第一个“挑战”),然后页面要求我reCAPTCHA知道我是否是人类。这是问题所在,我想我需要传递用户代理和cookie(可能是我发生了代码错误),我不知道该怎么做。

    link = "https://www.oneblockdown.it/en/footwear-sneakers/adidas/men-unisex/adidas-originals-yeezy-boost-350-v2/9438"
    proxies = get_proxy(proxy_list) #I get proxies from a file...
    scraper = cfscrape.create_scraper() # returns a CloudflareScraper instance

    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36"
        }
    try:
        if(use_proxies):
            print("[Proxy]: " + proxies['http'])
            r = scraper.get(link,  proxies=proxies)

    except:
        print("Connection to URL <" + link + "> failed.")
        return

    soup = BeautifulSoup(r.text, 'html.parser')
    print(soup.prettify())

最后一次打印的响应是这样的:

'''

<script src="https://www.google.com/recaptcha/api.js?hl=" type="text/javascript">
  </script>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/1.11.3/jquery.js" type="text/javascript">
  </script>
 </head>
 <body>
  <div class="g-recaptcha" data-callback="getCaptchaResult" data-sitekey="6Le49hgUAAAAAIv3wrILeXIrOSdM3_5oxK4C6m48" data-size="invisible">
  </div>
  <script type="text/javascript">
   window.onload = function () { grecaptcha.execute(); };
function getCaptchaResult(response) {
    $.post("/index.php", {action: "captcha_verify", captcha: response, version: 37}, function(result){
        var timeout = result ? 0 : 2500;
        setTimeout(function() {
            window.location.reload();
        }, timeout);
    });
}
  </script>
  <script type="text/javascript">
   window.NREUM||(NREUM={});NREUM.info={"beacon":"bam.nr-data.net","licenseKey":"97b599ea8e","applicationID":"23522071","transactionName":"YFxXbENSCxEFUhVfWlkWdk1CRwoPS1cOWUFAXFRKHEALBwVaBERGGFhRUVVSFg==","queueTime":0,"applicationTime":54,"atts":"TBtUGgtIGB8=","errorBeacon":"bam.nr-data.net","agent":""}
  </script>
 </body>
</html>

'''

我需要确认我是人。 我该如何应对挑战?

0 个答案:

没有答案