Question

我还在学习python，这是我第一次访问网站并为自己抓取某些信息。我试图了解语言。所以欢迎任何输入。

以下数据是我从页面来源看到的内容。我必须访问某个页面才能输入我的登录信息。成功进入后。我被重定向到另一个页面以获取我的密码。我试图通过python请求发布帖子。在删除第三页信息之前，我必须先浏览两页。但是，我只能通过登录的第一页。

以下是为USERNAME调用的标头和POST信息。

对于USERNAME PAGE：

(Request-Line)  
POST /client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC HTTP/1.1
Host    www.card.com
User-Agent  Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/
Cookie  JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection  keep-alive
Content-Type    application/x-www-form-urlencoded
Content-Length  13

Post Data: 
login,  USERLOGIN

以下是为密码调用的页眉和帖子信息：

For the PASSWORD PAGE:
(Request-Line)  
POST /client/siteLogonClient.recip HTTP/1.1
Host    www.card.com
User-Agent  Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept  text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC
Cookie  JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection  keep-alive
Content-Type    application/x-www-form-urlencoded
Content-Length  133

Post Data: 
org.apache.struts.taglib.html.TOKEN,     583ed0aefe4b04b
login,  USERLOGIN
password, PASSWORD

这是我提出的，但我只能访问第一页。我调用函数second_pass（）后，我被重定向回第一页。

使用我的函数first_pass（），我收到一个响应代码200.但是，我在second_pass（）上收到相同的代码，但是如果我打印出页面的文本，它会重定向到第一页。我从未成功进入第三页。

import requests
import re

response = None
r = None

payload = {'login' : 'USERLOGIN'}
# acesses the username screen and adds username
# give login name
def first_pass():
    global response
    global payload
    url = 'https://www.card.com/client/factor2UserId.recip'
    s = requests.Session()
    response = s.post(url, payload)
    return response.status_code


# merges payload with x that contains user password
def second_pass():
    global payload
    global r
    # global response
    x = {'password' : 'PASSWORD'} # I add the Password in this function cause I am not sure if it will fail the first_pass()
    payload.update(x)
    url = 'https://www.card.com/client/siteLogonClient.recip'
    r = requests.post(url, payload)
    return payload
    return r.status_code



# searches response for Token!
# if token found merges key:value into payload
def token_search():
    global response
    global payload
    f = response.text

    # uses regex to find the Token from the HTML
    patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
    findPat2 = re.search(patFinder2, f)

    # if the Token in found it turns it into a dictionary. and prints the dictionary 
    # if no Token is found it prints "nothing found" 
    if(findPat2):
        newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
        payload.update(newdict)
        print payload
    else:
        print "No Token Found"

我现在从shell调用我的函数。我按此顺序打电话给他们。 first_pass（），token_search（），second_pass（）。

当我调用token_search（）时，它会使用unicode更新字典。我不确定这是否是导致我的错误的原因。

对代码的任何建议都是最受欢迎的。我喜欢学习。但在这一点上，我正在撞墙。

Answer 1

如果您正在抓取数据，那么我建议您了解lxml或BeautifulSoup这样的库，以便更有效地从网页收集数据（与使用正则表达式相比）。

如果令牌查找代码有效，那么我的建议是重新安排这样的代码。它避免了全局变量，将变量保留在它们所属的范围内。

login('USERLOGIN', 'PASSWORD')

def login(username, password):
    loginPayload = {'login' : username}
    passPayload = {'password' : password}
    s = requests.Session()

    # POST the username
    url = 'https://www.card.com/client/factor2UserId.recip'
    postData = loginPayload.copy()
    response = s.post(url, postData)
    if response.status_code != requests.codes.ok:
        raise ValueError("Bad response in first pass %s" % response.status_code)
    postData.update(passPayload)
    tokenParam = token_search(response.text)
    if tokenParam is not None:
        postData.update(tokenParam)
    else:
        raise ValueError("No token value found!")
    # POST with password and the token
    url = 'https://www.card.com/client/siteLogonClient.recip'
    r = s.post(url, postData)
    return r


def token_search(resp_text):
    # uses regex to find the Token from the HTML
    patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
    findPat2 = re.search(patFinder2, resp_text)

    # if the Token in found it turns it into a dictionary. and prints the dictionary 
    # if no Token is found it prints "nothing found" 
    if findPat2:
        newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
        return newdict
    else:
        print "No Token Found"
        return None

Python在安全网站上发帖请求。

1 个答案: