使用Requests和Bs4(不是Selenium)对JavaScript动态网站进行网络抓取

时间:2020-06-27 18:23:38

标签: javascript python web-scraping python-requests python-requests-html

我是Python的新手,正在尝试创建一个自动文件下载器,但是为此,我必须抓取登录验证后的网站“下载”部分。

问题是此“下载”页面是动态生成的,我在使用哪种工具方面遇到了麻烦,因为使用Selenium只是不切实际,因为我想避免整个启动浏览器和登录的整个过程过程。

我使用“请求”打开会话并将有效负载插入登录区域。登录成功后,我正在尝试实现requests_html,因此它可以在不使用Scrapy,Splash或Docker的情况下呈现HTML,因为这些模块目前对我来说太高级了。

“请求”会话开始与“ requests_html”模块发生冲突,因此我在代码中对其进行了注释以将其禁用。在实现requests_html模块之前,我没有遇到任何错误,并且设法打印出页面HTML,但是没有呈现JavaScript。看起来是这样的:


# importing the modules we are going to need for scraping

import requests
import pyppdf.patch_pyppeteer
from bs4 import BeautifulSoup as bs
from requests_html import HTMLSession

# importing the username and password from the .env file (in the same folder)
# using the dotenv module

import os
from dotenv import load_dotenv
load_dotenv()

user = os.environ.get("USER")
password = os.environ.get("PASSWORD")

# for the login, we need to create a dictionary that contains the necessary input information
# (username and password that was imported from the .env file)
# with the "name" values defined in the HTML file of the site

payload = {
    "usr": "user", # "usr" = name, followed by the input value
    "passwd": "password" # "passwd" = name, followed by the input value

}

# defining the URL we want to scrape and the pre-login page as well
# (these are not the real URLs)

login_url = "https://www.example_site.com/"
real_url = "https://www.example_site.com/modules/user/DownloadsCentral"

# defining functions to start and close sessions

def open_session():

    # defining a function attribute to manually close the session afterward
    
    open_session.session = requests.Session()
    open_session.session.post(login_url, data = payload) # send (post) data (payload) to the site
    open_session.session.get(real_url) # requesting a session start in the url we want to scrape

    # HTML part (haven't figured out what went wrong, so I switched back to open a session with Requests)

    # html_session = HTMLSession()
    # html_session.post(login_url, data = payload)
    # open_session.get_session = html_session.get(real_url)

# close the session (required) by calling the function close_session() below

def close_session():
    requests.Session.close(open_session.session)



# calling the function to open a session

open_session()

# HTML render (didn't work)

# js_url = open_session.real_session.html.render(sleep = 2, keep_page = True, scrolldown = 0)
# files = js_url.html.find("css searcher")
# print(files)

# once we're able to establish a connection with the internal URL
# we can parse the HTML file using the lxml parser module

soup = bs(real_url.html, "lxml") # soup (html) = bs4 function of the html we're using parsed by lxml
print(soup.prettify())


# closing the session after everything is done

close_session()

(我是初次在这里发布问题,所以如果我发布了一些不必要的信息,那太糟糕了。)

image of the data I'm trying to scrape is here

0 个答案:

没有答案