验证过程后,Python请求模块无法获取URL

时间:2019-03-14 23:07:52

标签: python http-headers python-requests

正在从“国家性罪犯”网站(https://www.nsopw.gov/en-US/Search/Verification)抓取数据,这需要进行验证才能执行搜索。我正在使用requests模块与第三方API进行通信,以解决reCAPTCHA(目前有效)。

payload变量包含我认为是完成验证所必需的字段,但是由于我不是标题/表单/等的精通者,我可能遗漏了某些内容。

问题是当我get搜索页面时,返回的URL仍然是验证页面。我想念什么?我是否需要使用类似mechanize的格式来提交表单?

import requests

API_KEY = '*****'  # Hidden API key
site_key = '6Lf3ew4UAAAAAFlPnGmxOJZjjZHSZBuHDIE0yidt' # g-site-key
url = 'https://www.nsopw.gov/en-US/Search/Verification' # verification url

# ========== Lines 7 - 20 solve and return the recaptcha value ========== #
s = requests.Session()

captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url)).text.split('|')[1]
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text

print("Solving reCAPTCHA...")
while 'CAPCHA_NOT_READY' in recaptcha_answer:
    sleep(5)
    recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text

print('reCAPTCHA solved!')

recaptcha_answer = recaptcha_answer.split('|')[1]

# ========== Lines 23 - 29 attempt to submit the form data to get to the search URL. The g-recaptcha-response value is used here. ========== #
headers = { 'user-agent': 'Mozilla/5.0 Chrome/52.0.2743.116 Safari/537.36'} 
payload = { 'acceptTerms': 'true', 'acceptTerms': 'false', 'g-recaptcha-response': recaptcha_answer }

response = s.post(url, headers=headers, data=payload)

search = s.get("https://www.nsopw.gov/en-us/search/")
print (search.url) # Would expect the search url to print but doesn't work :(. Prints the verification page (https://www.nsopw.gov/en-us/search/verification/)

0 个答案:

没有答案