Question

我打开了一个网页，并使用webdriver代码登录。使用webdriver是因为在设置为scrape之前页面需要登录和各种其他操作。

目的是从这个打开的页面中抓取数据。需要找到链接并打开它们，所以selenium webdriver和BeautifulSoup之间会有很多组合。

我查看了bs4的文档，BeautifulSoup(open("ccc.html"))引发了错误

soup = bs4.BeautifulSoup(open("https://m/search.mp?ss=Pr+Dn+Ts"))

OSError：[Errno 22]参数无效：'https://m/search.mp?ss=Pr+Dn+Ts'

我认为这是因为它不是.html？

Answer 1

您正在尝试通过网址打开网页。 open()不会这样做，请使用urlopen()：

from urllib.request import urlopen  # Python 3
# from urllib2 import urlopen  # Python 2

url = "your target url here"
soup = bs4.BeautifulSoup(urlopen(url), "html.parser")

或者，对人类使用HTTP - requests library：

import requests

response = requests.get(url)
soup = bs4.BeautifulSoup(response.content, "html.parser")

另请注意，强烈建议specify a parser explicitly - 在这种情况下，我使用了html.parser，还有其他解析器可用。

我想使用完全相同的页面（相同的实例）

执行此操作的常用方法是获取driver.page_source并将其传递给BeautifulSoup以进行进一步解析：

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Firefox()
driver.get(url)

# wait for page to load..

source = driver.page_source
driver.quit()  # remove this line to leave the browser open

soup = BeautifulSoup(source, "html.parser")

使用已打开的网页（含硒）到beautifulsoup？

1 个答案: