使用正确的网址使用BS进行Python网页抓取?

时间:2018-06-17 17:21:03

标签: python web-scraping beautifulsoup python-requests

初学者。到目前为止我有这个代码:

import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"lxml")

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t01': 'username',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }

    r = s.post(posturl, data=values)
    print r.content

logurl =发送登录的网址 posturl =表单操作网址,其中发布了登录数据。

但是,当我尝试使用此内容时,即使输入正确,内容也会返回“密码错误”页面。

当我手动正确登录以查看包含我需要的数据的正确网址页面时,我注意到该网址实际上是下面列出的位置网址(来自Chrome工具'网络',请参见下图) ,其中包含代码中的flow_idinstance值:

Location: https://login.flash.co.za/apex/f?p=1500:1:9004571425464

Request URL: https://login.flash.co.za/apex/wwv_flow.accept

Referer: https://login.flash.co.za/apex/f?p=pwfone:login

enter image description here

我不应该试图发布'到这个网址,而不是请求网址?

编辑1:

  import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {
            "Host": "login.flash.co.za",
            "Connection": "keep-alive",
            "Origin": "https://login.flash.co.za",
            "Upgrade-Insecure-Requests": "1",
            "Content-Type": "application/x-www-form-urlencoded",
            "User-Agent": "Mozilla/5.0 (Windows NT x.y; rv:10.0) Gecko/20100101 Firefox/10.0",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
            "Referer": "https://login.flash.co.za/apex/f?p=pwfone:login",
            "Accept-Encoding": "gzip, deflate, br",
            "Accept-Language": "en-US,en;q=0.9",
    }
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"html.parser")

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t01': 'solar',
        'p_arg_names': soup.select_one("[name='p_arg_names']")['value'],
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }

    r = s.post(posturl, data=values)
    print r.content

2 个答案:

答案 0 :(得分:0)

mappedStream

中拦截了请求

您发布到的网址是正确的,只需设置以下标题并尝试重新登录

即可
Fiddler

答案 1 :(得分:0)

" p_arg_names"具有相同的值;两次。它应该是两个不同的值。尝试将其作为这样的列表传递(完全未经测试的代码,因为我没有用户名或密码。):

import requests
from bs4 import BeautifulSoup

logurl = "https://login.flash.co.za/apex/f?p=pwfone:login"
posturl = 'https://login.flash.co.za/apex/wwv_flow.accept'

with requests.Session() as s:
    s.headers = {"User-Agent":"Mozilla/5.0"}
    res = s.get(logurl)
    soup = BeautifulSoup(res.text,"lxml")

    arg_names =[]
    for name in  soup.select("[name='p_arg_names']"):
        arg_names.append(name['value'])

    values = {
        'p_flow_id': soup.select_one("[name='p_flow_id']")['value'],
        'p_flow_step_id': soup.select_one("[name='p_flow_step_id']")['value'],
        'p_instance': soup.select_one("[name='p_instance']")['value'],
        'p_page_submission_id': soup.select_one("[name='p_page_submission_id']")['value'],
        'p_request': 'LOGIN',
        'p_t01': 'username',
        'p_arg_names': arg_names,
        'p_t02': 'password',
        'p_md5_checksum': soup.select_one("[name='p_md5_checksum']")['value'],
        'p_page_checksum': soup.select_one("[name='p_page_checksum']")['value']
    }
    s.headers.update({'Referer': logurl})
    r = s.post(posturl, data=values)
    print (r.content)