Question

我正在尝试从下一页抓取新闻内容，但没有成功。 https://www.business-humanrights.org/en/latest-news/?&search=nike

我尝试过Beautifulsoup

r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup

但是我要查找的内容-标记为div class ='card__content'的新闻片段没有出现在汤输出中。

我也检查过，但找不到要切换到的帧。

最后，我尝试使用phantomjs和以下代码，但未成功：

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')

driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
    'class':'card__content'})
print(container)

我的选件已用完，任何人都可以提供帮助？

Answer 1

driver.page_source返回初始的HTML文档内容，无论您等待多长时间（time.sleep(7)都无效）。

尝试以下方法：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()

Answer 2

使用API

import requests


r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")


print(r.json())

Answer 3

我不明白您为什么要面对这个问题。我在上面尝试了相同的方法，但没有使用请求和bs4。我使用了requests_html。 xpaths可以直接在此库中使用，而无需任何其他库。

import requests_html

session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)


divs_with_required_class = res.html.xpath(r'//div[@class="card__content"]')

for item in divs_with_required_class:
    print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')

抓取：无法从网页中提取内容

3 个答案: