我正在尝试网络抓取具有多个由Javascript呈现的页面的网站。我正在使用BeautifulSoup和Selenium。我有一个仅适用于网站首页的脚本。是否可以对多个JavaScript呈现的网页进行网络抓取,或者我需要单独进行处理?这是我的脚本:
jsp
谢谢。
答案 0 :(得分:0)
问题在这里:
requests.get()
与browser.get()
混合在一起。这里完全不需要requests
模块,因为您是通过无头浏览器获取页面的。 time.sleep()
应该在browser.get()
和解析之间,以允许页面完全加载后再将其提供给BeautifulSoup。data
循环之外将for
写到JSON文件。for
循环之外退出浏览器,而不是一次迭代。open("js-webscrape.json", "w", encoding="utf-8")
这是一个可抓取全部7页的有效实现:
import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json
# The path to where you have your chrome webdriver stored:
webdriver_path = '/Users/Gebruiker/Downloads/chromedriver_win32/chromedriver'
# Add arguments telling Selenium to not actually open a window
chrome_options = Options()
chrome_options.add_argument('--headless')
# Fire up the headless browser
browser = webdriver.Chrome(executable_path = webdriver_path, options = chrome_options)
# Load webpage
url = "https://cnx.org/search?q=subject:Arts"
data = []
n = 7
for i in range(1, n+1):
response = browser.get(url + "&page=" + str(i))
time.sleep(5)
# Parse HTML
page_soup = soup(browser.page_source,'lxml')
containers = page_soup.findAll("tr")
for container in containers:
item = dict()
item['type'] = "Course Material"
if container.find('td', {'class' : 'title'}):
item['title'] = container.find('td', {'class' : 'title'}).h4.text.strip()
else:
item['title'] = ""
if container.find('td', {'class' : 'authors'}):
item['author'] = container.find('td', {'class' : 'authors'}).text.strip()
else:
item['author'] = ""
if container.find('td', {'class' : 'title'}):
item['link'] = "https://cnx.org/" + container.find('td', {'class' : 'title'}).a["href"]
else:
item['link'] = ""
if container.find('td', {'class' : 'title'}):
item['description'] = container.find('td', {'class' : 'title'}).span.text
else:
item['description'] = ""
item['subject'] = "Arts"
item['source'] = "OpenStax CNX"
item['base_url'] = "https://cnx.org/browse"
item['license'] = "Attribution"
data.append(item) # add the item to the list
# write data to file and quit browser when done
print(data)
with open("js-webscrape.json", "w", encoding="utf-8") as writeJSON:
json.dump(data, writeJSON, ensure_ascii=False)
browser.quit()