Question

以下是在页面中登录并获取源代码的代码。

import requests
import sys
import urllib, urllib2, cookielib

USERNAME = ''
PASSWORD = ''

URL = 'http://coned.com'

def main():
    # Start a session so we can have persistant cookies
    session = requests.session()
    cj = cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

    # This is the form data that the page sends when logging in
    login_data = {
        'TxtUser': USERNAME,
        'TxtPwd': PASSWORD,
        'submit': 'Sign In',
    }

    # Authenticate
    r = session.post(URL, data=login_data)

    # Try accessing a page that requires you to be logged in
    r = session.get('https://apps1.coned.com/cemyaccount/MemberPages/MyAccounts.aspx?lang=eng')
    resp = opener.open('https://apps1.coned.com/cemyaccount/MemberPages/MyAccounts.aspx?lang=eng')
    print resp
    print r.text


if __name__ == '__main__':
    main()

这里r.text不起作用，登录后我需要页面的html代码。谁能帮助我在这做什么？

Answer 1

在Chrome中打开http://coned.com并打开“开发人员工具”窗格，我可以在下面跟踪我尝试的登录信息。我使用testtesttest作为用户名，使用test作为密码。

接头：

Request URL: https://apps2.coned.com/cemyaccount/NonMemberPages/Login.aspx?lang=eng
Request Method: POST
Status Code: 200 OK

数据：

TxtUser:testtesttest
UserName:VALUE
UserName:0
TxtPwd:test
UserName2:VALUE
UserName2:0
ctl00$Main$Login1$LoginButton:Sign In

了解这一点，您应该使用其他参数构建数据字典：

URL = 'https://apps2.coned.com/cemyaccount/NonMemberPages/Login.aspx?lang=eng'

# This is the form data that the page sends when logging in
login_data = {
    'TxtUser': USERNAME,
    'UserName': 'VALUE',
    'UserName': '0',
    'TxtPwd': PASSWORD,
    'UserName2': 'VALUE',
    'UserName2': '0',
    'ctl00$Main$Login1$LoginButton': 'Sign In',
}

# Authenticate to the login page
r = session.post(URL, data=login_data)

# now, r.text will contain the html results of the page you just requested. In this case, the login page's redirected response.
# Check if the word successful appears in the results...
print filter(lambda x: 'success' in x.lower(), r.text.splitlines())

该网站似乎会向您显示登录页面，如果您的登录无效，则该页面包含一个额外的HTML：

<span id="ctl00_Main_FailureMsg">Your sign In attempt was not successful. Please try again.  If you have not created your registry information you can register now.</span>

最后，您还应该考虑mechanize或scrapy。这两个工具都有很好的文档记录，专门用于完成你所追求的工作。

希望能指出你的方向。

登录请求后无法获取页面源代码

1 个答案: