使用Python登录网站和网页抓取

时间:2018-12-13 17:09:50

标签: python beautifulsoup python-requests

我需要抓取数据的网页是在登录页面之后。我尝试了许多方法来完成此操作,但似乎没有一种有效。有人可以帮忙吗?我的代码在下面...

import requests

from bs4 import BeautifulSoup

headers = {                                                               
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) 
                  AppleWebKit/537.36(KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36'}
login_data = {                                                                
    'appname': 'unknown',
    'appversion': 'unknown',
    'ostype': 'mozilla/5.0 (windows nt 10.0; win64; x64) applewebkit/537.36 
              (khtml, like gecko) chrome/70.0.3538.110 safari/537.36',
    'type': 'null',
    'ssobypass': 'true',
    'dirlogin': 'true',
    'inch': 'true',
    'scrWidth': '1920',
    'scrHeight': '1040',
    'username': 'TA_KAITM_B_4a',
    'userpassword': ''}

with requests.Session() as s:
    url = "http://cmis.ittdublin.ie"
    r = s.get(url, headers=headers)
    soup = BeautifulSoup(r.content, 'lxml')
    r = s.post(url, data=login_data, headers=headers)
    print(r.content)

不允许我在此处添加登录屏幕的HTML ... 下面是代码,如果运行将返回登录页面的HTML ...

import requests
from lxml import html

session_requests = requests.session()
login_url = "http://cmis.ittdublin.ie/eportal/index.jsp"
result = session_requests.get(login_url)
payload = {
    "username": "TA_KAITM_B_4a"
}
result = session_requests.post(
    login_url, 
    data = payload, 
    headers = dict(referer=login_url)
)
print(result.text)
url = 'http://cmis.ittdublin.ie/eportal/index.jsp'
result = session_requests.get(
    url, 
    headers = dict(referer = url)
)

1 个答案:

答案 0 :(得分:0)

您需要发布的网址是

http://cmis.ittdublin.ie/eportal/PortalServ?reqtype=login

我对此很乐观。是否使您进入有用的地方取决于setAdminLoginLocation()的作用,但是除了管理员登录之外,它什么都不做。