我同时使用selenium
进行自动化和抓取。现在,我发现某些网站的速度太慢了。如果我使用beautifulSoup
,则可以更快地抓取它们,但是自动化无法完成。
无论如何,我可以在其中使网站自动化(按钮单击事件等),也可以在beautifulSoup
上用它抓取网站吗?
您能给我一个bs4
+ selenium
的按钮/搜索自动化示例吗?
任何帮助将不胜感激...
答案 0 :(得分:1)
示例
from bs4 import BeautifulSoup as Soup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://stackoverflow.com/questions/tagged/beautifulsoup+selenium")
page = Soup(driver.page_source, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
或者只是
import requests
from bs4 import BeautifulSoup as Soup
url = 'https://stackoverflow.com/questions/tagged/beautifulsoup+selenium'
response = requests.get(url=url)
page = Soup(response.text, features='html.parser')
questions = page.select("#questions h3 a[href]")
for question in questions:
print(question.text.strip())
答案 1 :(得分:0)
绝对。您可以使用selenium进行所有渲染,并将页面源传递给beautifulsoup,如下所示:
from bs4 import BeautifulSoup as bs
soup = bs(driver.page_source,'html.parser')
答案 2 :(得分:0)
这个如何让它实时DOM并加载js,享受并节省你的搜索时间,想法是得到整个身体,如果你也想要头部替换身体,它会和硒完全一样,我希望你喜欢这一切。
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
dri = webdriver.Chrome(options=options)
html = dri.find_element_by_tag_name("body").get_attribute('innerHTML')
soup = BeautifulSoup(html, features="lxml")