我正在尝试从以下URL中解析数据:http://134.209.71.24/ui/attacks/,但由于在http://134.209.71.24/ui/login/?next=%2F上有一个登录页面,所以无法解析。我正在将Python的requests
模块与BeautifulSoup
结合使用。
nikhilh@ubuntu:~/combine$ python -V
Python 2.7.15rc1
我编写了以下代码:
import re
import sys
import requests
from bs4 import BeautifulSoup
url = "http://134.209.71.24/ui/attacks/"
url_login = re.sub('attacks', 'login/?next=%2F', url)
print('Need to login into ' + url_login)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'}
with requests.Session() as client:
soup = BeautifulSoup(client.get(url_login).text, 'lxml')
# Find csrf token value
csrftoken_field = soup.find_all("input", type="hidden")
csrftoken_value = csrftoken_field[0]['value']
login_data = {"email": "valid_email",
"passwd": "valid_passwd",
"_csrf_token": csrftoken_value}
# login
post_result = client.post(url_login, data=login_data, headers=headers)
status_code = post_result.status_code
if status_code == 502:
print("Failed to login into " + url_login + ". Exiting...")
sys.exit();
print("Status code: " + str(status_code) + ". Login successful")
# Get required data from URL
read_data = client.get(url)
print(read_data.text)
登录后我得到的响应代码为200
,但是当登录完成后尝试解析http://134.209.71.24/ui/attacks/时,仍然可以获得登录页面HTML文档。这是输出的相关部分:
Need to login into http://134.209.71.24/ui/login/?next=%2F/
Status code: 200. Login successful
<!doctype html>
...
...
<input id="_csrf_token" name="_csrf_token" type="hidden" value="valid_csrf_token">
<fieldset>
<legend>Log In</legend>
<label>Email</label>
<input id="email" name="email" type="text" />
<label>Password</label>
<input id="passwd" name="passwd" type="password" />
...
...