Question

我正在使用Python和Selenium尝试从某个搜索页面的结果页面中删除所有链接。无论我在上一个屏幕中搜索到什么内容，结果页面上任何搜索的网址都是：“https://chem.nlm.nih.gov/chemidplus/ProxyServlet” 如果我使用Selenium进行自动搜索，那么尝试将此URL读入BeautifulSoup，我得到 HTTPError：HTTP错误404：未找到

这是我的代码：

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.by import By
from urllib.request import urlopen
from bs4 import BeautifulSoup
import csv


# create a new Firefox session
driver = webdriver.Firefox()
# wait 3 seconds for the page to load
driver.implicitly_wait(3)

# navigate to ChemIDPlus Website
driver.get("https://chem.nlm.nih.gov/chemidplus/")
#implicit wait 10 seconds for drop-down menu to load
driver.implicitly_wait(10)

#open drop-down menu QV7 ("Route:")
select=Select(driver.find_element_by_name("QV7"))
#select "inhalation" in QV7
select.select_by_visible_text("inhalation")
#identify submit button

搜索= “/ HTML /体/格[2] / DIV / DIV [2] / DIV / DIV [2] /形式/ DIV [1] / DIV /跨度/按钮[1]”

#click submit button
driver.find_element_by_xpath(search).click()

#increase the number of results per page
select=Select(driver.find_element_by_id("selRowsPerPage"))
select.select_by_visible_text("25")
#wait 3 seconds
driver.implicitly_wait(3)

#identify current search page...HERE IS THE ERROR, I THINK
url1="https://chem.nlm.nih.gov/chemidplus/ProxyServlet"
page1=urlopen(url1)
#read the search page
soup=BeautifulSoup(page1.content, 'html.parser')

我怀疑这与代理服务器有关，Python没有收到识别网站的必要信息，但我不知道如何解决这个问题。提前谢谢！

Answer 1

我使用Selenium将新网址识别为识别正确搜索页面的解决方法： URL1 = driver.current_url 接下来，我使用请求获取内容并将其提供给beautifulsoup。总之，我补充道：

#Added to the top of the script
import requests
...
#identify the current search page with Selenium
url1=driver.current_url
#scrape the content of the results page
r=requests.get(url)
soup=BeautifulSoup(r.content, 'html.parser')
...

无法通过URL识别BeautifulSoup中的网页

1 个答案: