无法解析登录门户背后的受保护页面-请求模块Python

时间:2019-03-22 05:31:59

标签: python python-2.7 beautifulsoup python-requests html-parsing

我正在尝试从以下URL中解析数据:http://134.209.71.24/ui/attacks/,但由于在http://134.209.71.24/ui/login/?next=%2F上有一个登录页面,所以无法解析。我正在将Python的requests模块与BeautifulSoup结合使用。

nikhilh@ubuntu:~/combine$ python -V
Python 2.7.15rc1

我编写了以下代码:

import re
import sys
import requests
from bs4 import BeautifulSoup

url = "http://134.209.71.24/ui/attacks/"
url_login = re.sub('attacks', 'login/?next=%2F', url)
print('Need to login into ' + url_login)

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:66.0) Gecko/20100101 Firefox/66.0'}

with requests.Session() as client:
    soup = BeautifulSoup(client.get(url_login).text, 'lxml')

    # Find csrf token value
    csrftoken_field = soup.find_all("input", type="hidden")
    csrftoken_value = csrftoken_field[0]['value']
    login_data = {"email": "valid_email",
                  "passwd": "valid_passwd",
                  "_csrf_token": csrftoken_value}

    # login
    post_result = client.post(url_login, data=login_data, headers=headers)

    status_code = post_result.status_code
    if status_code == 502:
        print("Failed to login into " + url_login + ". Exiting...")
        sys.exit();
    print("Status code: " + str(status_code) + ". Login successful")

    # Get required data from URL
    read_data = client.get(url)
    print(read_data.text)

登录后我得到的响应代码为200,但是当登录完成后尝试解析http://134.209.71.24/ui/attacks/时,仍然可以获得登录页面HTML文档。这是输出的相关部分:

Need to login into http://134.209.71.24/ui/login/?next=%2F/
Status code: 200. Login successful
<!doctype html>
...
...
    <input id="_csrf_token" name="_csrf_token" type="hidden" value="valid_csrf_token">
    <fieldset>
        <legend>Log In</legend>
        <label>Email</label>
        <input id="email" name="email" type="text" />
        <label>Password</label>
        <input id="passwd" name="passwd" type="password" />
...
...

0 个答案:

没有答案