我正在尝试从下一页抓取新闻内容,但没有成功。 https://www.business-humanrights.org/en/latest-news/?&search=nike
我尝试过Beautifulsoup
r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup
但是我要查找的内容-标记为div class ='card__content'的新闻片段没有出现在汤输出中。
我也检查过,但找不到要切换到的帧。
最后,我尝试使用phantomjs和以下代码,但未成功:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')
driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
'class':'card__content'})
print(container)
我的选件已用完,任何人都可以提供帮助?
答案 0 :(得分:0)
driver.page_source
返回初始的HTML文档内容,无论您等待多长时间(time.sleep(7)
都无效)。
尝试以下方法:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()
答案 1 :(得分:0)
使用API
import requests
r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")
print(r.json())
答案 2 :(得分:0)
我不明白您为什么要面对这个问题。我在上面尝试了相同的方法,但没有使用请求和bs4。我使用了requests_html
。 xpaths可以直接在此库中使用,而无需任何其他库。
import requests_html
session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)
divs_with_required_class = res.html.xpath(r'//div[@class="card__content"]')
for item in divs_with_required_class:
print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')