正在从“国家性罪犯”网站(https://www.nsopw.gov/en-US/Search/Verification)抓取数据,这需要进行验证才能执行搜索。我正在使用requests
模块与第三方API进行通信,以解决reCAPTCHA(目前有效)。
payload
变量包含我认为是完成验证所必需的字段,但是由于我不是标题/表单/等的精通者,我可能遗漏了某些内容。
问题是当我get
搜索页面时,返回的URL仍然是验证页面。我想念什么?我是否需要使用类似mechanize
的格式来提交表单?
import requests
API_KEY = '*****' # Hidden API key
site_key = '6Lf3ew4UAAAAAFlPnGmxOJZjjZHSZBuHDIE0yidt' # g-site-key
url = 'https://www.nsopw.gov/en-US/Search/Verification' # verification url
# ========== Lines 7 - 20 solve and return the recaptcha value ========== #
s = requests.Session()
captcha_id = s.post("http://2captcha.com/in.php?key={}&method=userrecaptcha&googlekey={}&pageurl={}".format(API_KEY, site_key, url)).text.split('|')[1]
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
print("Solving reCAPTCHA...")
while 'CAPCHA_NOT_READY' in recaptcha_answer:
sleep(5)
recaptcha_answer = s.get("http://2captcha.com/res.php?key={}&action=get&id={}".format(API_KEY, captcha_id)).text
print('reCAPTCHA solved!')
recaptcha_answer = recaptcha_answer.split('|')[1]
# ========== Lines 23 - 29 attempt to submit the form data to get to the search URL. The g-recaptcha-response value is used here. ========== #
headers = { 'user-agent': 'Mozilla/5.0 Chrome/52.0.2743.116 Safari/537.36'}
payload = { 'acceptTerms': 'true', 'acceptTerms': 'false', 'g-recaptcha-response': recaptcha_answer }
response = s.post(url, headers=headers, data=payload)
search = s.get("https://www.nsopw.gov/en-us/search/")
print (search.url) # Would expect the search url to print but doesn't work :(. Prints the verification page (https://www.nsopw.gov/en-us/search/verification/)