Question

我正在使用请求和bs4从链接http://duckduckgo.com/?q=who+is+harry+potter提取第一个预览

但是，当我尝试使用bs4的find方法来查找带有“ result__snippet”类的div时，它将返回None。但是，当我将整个网页保存到硬盘上并直接将其打开并用bs4进行解析时，soup.find('div', class_='result__snippet').get_text()将返回完美的输出。

有帮助吗？

Answer 1

您链接到的网站似乎使用JavaScript来构建搜索结果，因此您使用BeautifulSoup检索的页面实际上尚未包含搜索结果。

如果您查看已检索页面（print(soup.text)）的内容，您会发现他们建议如果您没有启用JavaScript来使用http://duckduckgo.com/html/?q=who+is+harry+potter。

抓取此URL应该可以为您提供所需的内容。

Answer 2

一种方法是将Selenium与BeautifulSoup结合使用。试试这个，行得通。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent

url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'

profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
        print('Page is ready!')
        break 
    except TimeoutException:
        print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()

b_html = bs(html,'html.parser') 
x = b_html.find_all('div', class_='result__snippet')[0].get_text()

输出：

Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...

BeautifulSoup4找不到适当的元素

2 个答案: