我是Python的新手,正在尝试创建一个自动文件下载器,但是为此,我必须抓取登录验证后的网站“下载”部分。
问题是此“下载”页面是动态生成的,我在使用哪种工具方面遇到了麻烦,因为使用Selenium只是不切实际,因为我想避免整个启动浏览器和登录的整个过程过程。
我使用“请求”打开会话并将有效负载插入登录区域。登录成功后,我正在尝试实现requests_html
,因此它可以在不使用Scrapy,Splash或Docker的情况下呈现HTML,因为这些模块目前对我来说太高级了。
“请求”会话开始与“ requests_html”模块发生冲突,因此我在代码中对其进行了注释以将其禁用。在实现requests_html模块之前,我没有遇到任何错误,并且设法打印出页面HTML,但是没有呈现JavaScript。看起来是这样的:
# importing the modules we are going to need for scraping
import requests
import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup as bs
from requests_html import HTMLSession
# importing the username and password from the .env file (in the same folder)
# using the dotenv module
import os
from dotenv import load_dotenv
load_dotenv()
user = os.environ.get("USER")
password = os.environ.get("PASSWORD")
# for the login, we need to create a dictionary that contains the necessary input information
# (username and password that was imported from the .env file)
# with the "name" values defined in the HTML file of the site
payload = {
"usr": "user", # "usr" = name, followed by the input value
"passwd": "password" # "passwd" = name, followed by the input value
}
# defining the URL we want to scrape and the pre-login page as well
# (these are not the real URLs)
login_url = "https://www.example_site.com/"
real_url = "https://www.example_site.com/modules/user/DownloadsCentral"
# defining functions to start and close sessions
def open_session():
# defining a function attribute to manually close the session afterward
open_session.session = requests.Session()
open_session.session.post(login_url, data = payload) # send (post) data (payload) to the site
open_session.session.get(real_url) # requesting a session start in the url we want to scrape
# HTML part (haven't figured out what went wrong, so I switched back to open a session with Requests)
# html_session = HTMLSession()
# html_session.post(login_url, data = payload)
# open_session.get_session = html_session.get(real_url)
# close the session (required) by calling the function close_session() below
def close_session():
requests.Session.close(open_session.session)
# calling the function to open a session
open_session()
# HTML render (didn't work)
# js_url = open_session.real_session.html.render(sleep = 2, keep_page = True, scrolldown = 0)
# files = js_url.html.find("css searcher")
# print(files)
# once we're able to establish a connection with the internal URL
# we can parse the HTML file using the lxml parser module
soup = bs(real_url.html, "lxml") # soup (html) = bs4 function of the html we're using parsed by lxml
print(soup.prettify())
# closing the session after everything is done
close_session()
(我是初次在这里发布问题,所以如果我发布了一些不必要的信息,那太糟糕了。)