无法从需要登录的页面中抓取

时间:2019-07-14 10:21:00

标签: python web-scraping

我必须从需要登录的网站上抓取数据。

这是我正在使用的当前代码,但没有获取登录页面的HTML。

from requests import Session
from bs4 import BeautifulSoup as bs

with Session() as s:
    site = s.get("https://www.valueresearchonline.com/membership/getin.asp?ref=%2Fport_v1%2Fdefault%2Easp%3Fselv%3D8%26poid%3D1443091")
    bs_content = bs(site.content, "html.parser")
    token = bs_content.find("input", {"name":"ref"})["value"]
    login_data = {"username":"<username>","password":"<password>","ref":token}
    p = s.post("https://www.valueresearchonline.com/membership/getin.asp?ref=%2Fport_v1%2Fdefault%2Easp%3Fselv%3D8%26poid%3D1443091",login_data)
    print(p.text)

我得到的HTML与登录前的HTML相同。此外,我不确定该站点是否需要令牌部分,因此我尝试过一次使用它,一次不使用它,但我两种情况的结果都与我解释的相同。

2 个答案:

答案 0 :(得分:0)

中再添加一个参数

p = s.post("https://www.valueresearchonline.com/membership/getin.asp?ref=%2Fport_v1%2Fdefault%2Easp%3Fselv%3D8%26poid%3D1443091",login_data)

allow_redirects=True,并将URL更改为https://www.valueresearchonline.com/registration/loginprocess.asp

p = s.post("p = s.post("https://www.valueresearchonline.com/registration/loginprocess.asp", data=login_data, allow_redirects=True)", data=login_data, allow_redirects=True)

检查是否适合您。

答案 1 :(得分:0)

将您的电子邮件和密码放在payload['username']payload['password']的值之内,我想它将使您登录。<​​/ p>

import requests
from bs4 import BeautifulSoup

url = "https://www.valueresearchonline.com/membership/getin.asp"
post_url = "https://www.valueresearchonline.com/registration/loginprocess.asp"

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0'
    site = s.get(url)
    soup = BeautifulSoup(site.text, "lxml")
    payload = {item['name']:item.get('value','') for item in soup.select('input[name]')}
    payload['username'] = 'your email'
    payload['password'] = 'your password'
    p = s.post(post_url,data=payload)
    print(p.text)
相关问题