如何刮一个需要使用python和beautifulsoup登录的网站?

时间:2014-04-16 07:33:30

标签: python web-scraping beautifulsoup

如果我想先刮一个需要使用密码登录的网站,如何使用beautifulsoup4库开始使用python进行抓取?以下是我对不需要登录的网站所做的工作。

from bs4 import BeautifulSoup    
import urllib2 
url = urllib2.urlopen("http://www.python.org")    
content = url.read()    
soup = BeautifulSoup(content)

如何更改代码以适应登录?假设我要抓的网站是一个需要登录的论坛。一个例子是http://forum.arduino.cc/index.php

5 个答案:

答案 0 :(得分:44)

你可以使用mechanize:

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cj = cookielib.CookieJar()
br = mechanize.Browser()
br.set_cookiejar(cj)
br.open("https://id.arduino.cc/auth/login/")

br.select_form(nr=0)
br.form['username'] = 'username'
br.form['password'] = 'password.'
br.submit()

print br.response().read()

或urllib - Login to website using urllib2

答案 1 :(得分:7)

从我的观点来看,有一种更简单的方法可以使您在没有seleniummechanize或其他第三方工具的情况下到达那里,尽管它是半自动的。

基本上,当您以常规方式登录网站时,您会使用自己的凭据以唯一的方式标识自己,此后,其他所有互动均使用相同的身份,并存储在cookies和{ {1}},时间很短。

您需要做的是在发出http请求时使用相同的headerscookies

要复制该代码,请按照以下步骤操作:

  1. 在浏览器中,打开开发人员工具
  2. 转到该站点,然后登录
  3. 登录后,进入“网络”标签,然后然后 刷新页面
    此时,您应该看到一个请求列表,最上面的是实际站点-这将是我们的重点,因为它包含具有我们可以用于Python和BeautifulSoup进行刮除的身份的数据
  4. 右键单击站点请求(顶部请求),将鼠标悬停在headers上,然后将鼠标移到copy
    像这样:

enter image description here

  1. 然后转到将cURL转换为python请求的网站:https://curl.trillworks.com/
  2. 获取python代码,并使用生成的copy as cURLcookies进行抓取

答案 2 :(得分:3)

您可以使用selenium登录并检索页面源,然后您可以将其传递给Beautiful Soup以提取所需的数据。

答案 3 :(得分:3)

如果您要硒,那么您可以执行以下操作:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.ui import WebDriverWait

# If you want to open Chrome
driver = webdriver.Chrome()
# If you want to open Firefox
driver = webdriver.Firefox()

username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("YourUsername")
password.send_keys("YourPassword")
driver.find_element_by_id("submit_btn").click()

但是,如果您坚持要只使用BeautifulSoup,则可以使用requestsurllib之类的库来实现。基本上,您要做的就是将POST数据作为带有URL的有效负载。

import requests
from bs4 import BeautifulSoup

login_url = 'http://example.com/login'
data = {
    'username': 'your_username',
    'password': 'your_password'
}

with requests.Session() as s:
    response = requests.post(login_url , data)
    print(response.text)
    index_page= s.get('http://example.com')
    soup = BeautifulSoup(index_page.text, 'html.parser')
    print(soup.title)

答案 4 :(得分:2)

由于未指定Python版本,here is my take on it for Python 3, done without any external libraries (StackOverflow)。登录后,照常使用BeautifulSoup,或进行其他任何刮擦操作。

同样,script on my GitHub here

以下根据StackOverflow指南复制了整个脚本:

# Login to website using just Python 3 Standard Library
import urllib.parse
import urllib.request
import http.cookiejar

def scraper_login():
    ####### change variables here, like URL, action URL, user, pass
    # your base URL here, will be used for headers and such, with and without https://
    base_url = 'www.example.com'
    https_base_url = 'https://' + base_url

    # here goes URL that's found inside form action='.....'
    #   adjust as needed, can be all kinds of weird stuff
    authentication_url = https_base_url + '/login'

    # username and password for login
    username = 'yourusername'
    password = 'SoMePassw0rd!'

    # we will use this string to confirm a login at end
    check_string = 'Logout'

    ####### rest of the script is logic
    # but you will need to tweak couple things maybe regarding "token" logic
    #   (can be _token or token or _token_ or secret ... etc)

    # big thing! you need a referer for most pages! and correct headers are the key
    headers={"Content-Type":"application/x-www-form-urlencoded",
    "User-agent":"Mozilla/5.0 Chrome/81.0.4044.92",    # Chrome 80+ as per web search
    "Host":base_url,
    "Origin":https_base_url,
    "Referer":https_base_url}

    # initiate the cookie jar (using : http.cookiejar and urllib.request)
    cookie_jar = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
    urllib.request.install_opener(opener)

    # first a simple request, just to get login page and parse out the token
    #       (using : urllib.request)
    request = urllib.request.Request(https_base_url)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # parse the page, we look for token eg. on my page it was something like this:
    #    <input type="hidden" name="_token" value="random1234567890qwertzstring">
    #       this can probably be done better with regex and similar
    #       but I'm newb, so bear with me
    html = contents.decode("utf-8")
    # text just before start and just after end of your token string
    mark_start = '<input type="hidden" name="_token" value="'
    mark_end = '">'
    # index of those two points
    start_index = html.find(mark_start) + len(mark_start)
    end_index = html.find(mark_end, start_index)
    # and text between them is our token, store it for second step of actual login
    token = html[start_index:end_index]

    # here we craft our payload, it's all the form fields, including HIDDEN fields!
    #   that includes token we scraped earler, as that's usually in hidden fields
    #   make sure left side is from "name" attributes of the form,
    #       and right side is what you want to post as "value"
    #   and for hidden fields make sure you replicate the expected answer,
    #       eg. "token" or "yes I agree" checkboxes and such
    payload = {
        '_token':token,
    #    'name':'value',    # make sure this is the format of all additional fields !
        'login':username,
        'password':password
    }

    # now we prepare all we need for login
    #   data - with our payload (user/pass/token) urlencoded and encoded as bytes
    data = urllib.parse.urlencode(payload)
    binary_data = data.encode('UTF-8')
    # and put the URL + encoded data + correct headers into our POST request
    #   btw, despite what I thought it is automatically treated as POST
    #   I guess because of byte encoded data field you don't need to say it like this:
    #       urllib.request.Request(authentication_url, binary_data, headers, method='POST')
    request = urllib.request.Request(authentication_url, binary_data, headers)
    response = urllib.request.urlopen(request)
    contents = response.read()

    # just for kicks, we confirm some element in the page that's secure behind the login
    #   we use a particular string we know only occurs after login,
    #   like "logout" or "welcome" or "member", etc. I found "Logout" is pretty safe so far
    contents = contents.decode("utf-8")
    index = contents.find(check_string)
    # if we find it
    if index != -1:
        print(f"We found '{check_string}' at index position : {index}")
    else:
        print(f"String '{check_string}' was not found! Maybe we did not login ?!")

scraper_login()