抓取:无法从网页中提取内容

时间:2020-08-26 14:35:23

标签: python selenium screen-scraping

我正在尝试从下一页抓取新闻内容,但没有成功。 https://www.business-humanrights.org/en/latest-news/?&search=nike

我尝试过Beautifulsoup

r = requests.get("https://www.business-humanrights.org/en/latest-news/?&search=nike")
soup = BeautifulSoup(r.content, 'lxml')
soup

但是我要查找的内容-标记为div class ='card__content'的新闻片​​段没有出现在汤输出中。

我也检查过,但找不到要切换到的帧。

最后,我尝试使用phantomjs和以下代码,但未成功:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://www.business-humanrights.org/en/latest-news/?&search=nike"
driver = webdriver.PhantomJS(executable_path= '~\Chromedriver\phantomjs-2.1.1-windows\bin\phantomjs.exe')

driver.get(url)
time.sleep(7)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
container = soup.find_all('div', attrs={
    'class':'card__content'})
print(container)

我的选件已用完,任何人都可以提供帮助?

3 个答案:

答案 0 :(得分:0)

driver.page_source返回初始的HTML文档内容,无论您等待多长时间(time.sleep(7)都无效)。

尝试以下方法:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver.get(url)
cards = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='card__content' and normalize-space(.)]")))
texts = [card.text for card in cards]
print(texts)
driver.quit()

答案 1 :(得分:0)

使用API​​

import requests


r = requests.get("https://www.business-humanrights.org/en/api/internal/explore/?format=json&search=nike")


print(r.json())

答案 2 :(得分:0)

我不明白您为什么要面对这个问题。我在上面尝试了相同的方法,但没有使用请求和bs4。我使用了requests_html。 xpaths可以直接在此库中使用,而无需任何其他库。

import requests_html

session = requests_html.HTMLSession()
URL = 'https://www.business-humanrights.org/en/latest-news/?&search=nike'
res = session.get(URL)


divs_with_required_class = res.html.xpath(r'//div[@class="card__content"]')

for item in divs_with_required_class:
    print(f'Div {divs_with_required_class.index(item) + 1}:\n', item.text, end='\n\n')