我正在使用请求和bs4从链接http://duckduckgo.com/?q=who+is+harry+potter提取第一个预览
但是,当我尝试使用bs4的find方法来查找带有“ result__snippet”类的div时,它将返回None。但是,当我将整个网页保存到硬盘上并直接将其打开并用bs4进行解析时,soup.find('div', class_='result__snippet').get_text()
将返回完美的输出。
有帮助吗?
答案 0 :(得分:0)
您链接到的网站似乎使用JavaScript来构建搜索结果,因此您使用BeautifulSoup检索的页面实际上尚未包含搜索结果。
如果您查看已检索页面(print(soup.text)
)的内容,您会发现他们建议如果您没有启用JavaScript来使用http://duckduckgo.com/html/?q=who+is+harry+potter。
抓取此URL应该可以为您提供所需的内容。
答案 1 :(得分:0)
一种方法是将Selenium与BeautifulSoup结合使用。试试这个,行得通。
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent
url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'
profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
try:
WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
print('Page is ready!')
break
except TimeoutException:
print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()
b_html = bs(html,'html.parser')
x = b_html.find_all('div', class_='result__snippet')[0].get_text()
输出:
Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...