如何刮取受密码保护的网站

时间:2019-05-24 08:52:26

标签: python web-scraping python-requests

我在抓取受密码保护的网站时遇到了困难。我知道那里有很多问题,但是,没有一个问题解决了我的问题。

问题是,我不知道问题是什么。我确实从他们的服务器收到了200响应,但是,这不是我期望的内容。这确实是一个很大的HTML结构,但是其中包含诸如“访问”,“ RequestURLDenied”,“密码”,“帮助”,“登录”之类的词,这表明我的登录尝试无法正常进行。我不知道该怎么改变?有人有刮擦的经验吗?

到目前为止,这是我的代码(摘自here):

import requests
from lxml import html

USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
LOGIN_URL2 = "https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"

def main():
    # Create session
    session = requests.session()

    # Get login cookies
    session.get(LOGIN_URL)

    # Create payload - used to log into password protected area
    login_data = {
        "rmtoken": "dummy", 
        "request_id": "null", 
        "OAM_REQ": "null", 
        "userid": USERNAME,
        "password": PASSWORD,  
        "rmflag": "0", 
        "aci": "nu"
    }

    # Perform login
    session.post(LOGIN_URL, data = login_data)

    # Scrape url
    result = session.get(URL)

    # Content
    print(result.content)


if __name__ == '__main__':
    main()

这是我运行脚本时响应的样子:

script output

另一个问题:假设我可以从代码登录,并且执行了数千个服务器请求以提取文本,这是否可能导致其服务器:D出现问题?

1 个答案:

答案 0 :(得分:0)

所有代码看起来都正确,只是您将POST请求发送到的URL犯了一些错误,并且使用的负载不完整。

尝试以下代码:

import requests
from lxml import html
from lxml.etree import tostring

USERNAME = "XXX"
PASSWORD = "XXX"
LOGIN_URL = "https://signin.lexisnexis.com/lnaccess/app/signin?back=https%3A%2F%2Fadvance.lexis.com%3A443%2Fnexis-uni%2Flaapi%2Fpermalink%2F35a8b8d7-925d-4219-b89d-af27c10a7a31%2F%3Fcontext%3D1516831&aci=nu"
URL = "https://advance.lexis.com/api/document?collection=news&id=urn:contentItem:7XM6-WXH0-Y9M6-H1V0-00000-00&context=1516831"

def main():
    session_requests = requests.session()

    # Get login cookies
    session_requests.get(LOGIN_URL)

    # Create payload - used to log into password protected are
    payload = {
        "rmtoken": "dummy", 
        "request_id": "null", 
        "OAM_REQ": "null", 
        "userid": USERNAME,
        "password": PASSWORD,  
        "rmflag": "0", 
        "aci": "nu"
    }

    # Perform login
    result = session_requests.post("https://signin.lexisnexis.com:443/lnaccess/Transition?aci=nu", data = payload)

    # Scrape url
    result = session_requests.get(URL)
    tree = html.fromstring(result.content)
    # bucket_names = tree.xpath("//div[@class='repo-list--repo']/a/text()")

    print(tostring(tree))

if __name__ == '__main__':
    main()

希望这会有所帮助