我还在学习python,这是我第一次访问网站并为自己抓取某些信息。我试图了解语言。所以欢迎任何输入。
以下数据是我从页面来源看到的内容。我必须访问某个页面才能输入我的登录信息。成功进入后。我被重定向到另一个页面以获取我的密码。我试图通过python请求发布帖子。在删除第三页信息之前,我必须先浏览两页。但是,我只能通过登录的第一页。
以下是为USERNAME调用的标头和POST信息。
对于USERNAME PAGE:
(Request-Line)
POST /client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC HTTP/1.1
Host www.card.com
User-Agent Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/
Cookie JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection keep-alive
Content-Type application/x-www-form-urlencoded
Content-Length 13
Post Data:
login, USERLOGIN
以下是为密码调用的页眉和帖子信息:
For the PASSWORD PAGE:
(Request-Line)
POST /client/siteLogonClient.recip HTTP/1.1
Host www.card.com
User-Agent Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language en-US,en;q=0.5
Accept-Encoding gzip, deflate
DNT 1
Referer https://www.card.com/client/factor2UserId.recip;jsessionid=15AD9CDEB48362372EFFC268C146BBFC
Cookie JSESSIONID=15AD9CDEB48362372EFFC268C146BBFC
Connection keep-alive
Content-Type application/x-www-form-urlencoded
Content-Length 133
Post Data:
org.apache.struts.taglib.html.TOKEN, 583ed0aefe4b04b
login, USERLOGIN
password, PASSWORD
这是我提出的,但我只能访问第一页。我调用函数second_pass()后,我被重定向回第一页。
使用我的函数first_pass(),我收到一个响应代码200.但是,我在second_pass()上收到相同的代码,但是如果我打印出页面的文本,它会重定向到第一页。我从未成功进入第三页。
import requests
import re
response = None
r = None
payload = {'login' : 'USERLOGIN'}
# acesses the username screen and adds username
# give login name
def first_pass():
global response
global payload
url = 'https://www.card.com/client/factor2UserId.recip'
s = requests.Session()
response = s.post(url, payload)
return response.status_code
# merges payload with x that contains user password
def second_pass():
global payload
global r
# global response
x = {'password' : 'PASSWORD'} # I add the Password in this function cause I am not sure if it will fail the first_pass()
payload.update(x)
url = 'https://www.card.com/client/siteLogonClient.recip'
r = requests.post(url, payload)
return payload
return r.status_code
# searches response for Token!
# if token found merges key:value into payload
def token_search():
global response
global payload
f = response.text
# uses regex to find the Token from the HTML
patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
findPat2 = re.search(patFinder2, f)
# if the Token in found it turns it into a dictionary. and prints the dictionary
# if no Token is found it prints "nothing found"
if(findPat2):
newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
payload.update(newdict)
print payload
else:
print "No Token Found"
我现在从shell调用我的函数。我按此顺序打电话给他们。 first_pass(),token_search(),second_pass()。
当我调用token_search()时,它会使用unicode更新字典。我不确定这是否是导致我的错误的原因。
对代码的任何建议都是最受欢迎的。我喜欢学习。但在这一点上,我正在撞墙。
答案 0 :(得分:1)
如果您正在抓取数据,那么我建议您了解lxml或BeautifulSoup这样的库,以便更有效地从网页收集数据(与使用正则表达式相比)。
如果令牌查找代码有效,那么我的建议是重新安排这样的代码。它避免了全局变量,将变量保留在它们所属的范围内。
login('USERLOGIN', 'PASSWORD')
def login(username, password):
loginPayload = {'login' : username}
passPayload = {'password' : password}
s = requests.Session()
# POST the username
url = 'https://www.card.com/client/factor2UserId.recip'
postData = loginPayload.copy()
response = s.post(url, postData)
if response.status_code != requests.codes.ok:
raise ValueError("Bad response in first pass %s" % response.status_code)
postData.update(passPayload)
tokenParam = token_search(response.text)
if tokenParam is not None:
postData.update(tokenParam)
else:
raise ValueError("No token value found!")
# POST with password and the token
url = 'https://www.card.com/client/siteLogonClient.recip'
r = s.post(url, postData)
return r
def token_search(resp_text):
# uses regex to find the Token from the HTML
patFinder2 = re.compile(r"name=\"(org.apache.struts.taglib.html.TOKEN)\"\s+value=\"(.+)\"",re.I)
findPat2 = re.search(patFinder2, resp_text)
# if the Token in found it turns it into a dictionary. and prints the dictionary
# if no Token is found it prints "nothing found"
if findPat2:
newdict = dict(zip(findPat2.group(1).split(), findPat2.group(2).split()))
return newdict
else:
print "No Token Found"
return None