请求无法正常进行的网络抓取

时间:2019-12-12 20:46:50

标签: python web-scraping beautifulsoup python-requests

我正在尝试从CNN获取用于个人项目的html。我正在使用 requests 库,它是新手。我遵循了基本教程,使用请求从CNN获取HTML,但是始终获得与从浏览器检查网页时发现的HTML不同的响应。这是我的代码:

base_url = 'https://www.cnn.com/'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

我正在尝试从CNN获得文章标题,但这是我的第一期。感谢您的帮助!

更新 似乎我知道的比我最初想象的要少。我真正的问题是:如何从CNN主页中提取标题?我已经尝试了两个答案,但是请求中的HTML不包含标题信息。如何获得标题信息,例如这张图片(浏览器的屏幕截图)Screenshot of cnn article title with accompanying html side by side

3 个答案:

答案 0 :(得分:1)

我尝试了以下代码,并且对我有用。

base_url = 'https://www.cnn.com/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.79 Safari/537.36'
}
r = requests.get(base_url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
print(soup.prettify())

请注意,我在headers中指定了一个requests.get()参数。它所做的只是尝试模仿真实的浏览器,以使防刮算法无法检测到它。
希望这会有所帮助,如果没有,请随时在评论中问我。干杯:)

答案 1 :(得分:1)

您可以使用Selenium ChromeDriver刮擦https://cnn.com

import bs4 as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
driver = webdriver.Chrome("---CHROMEDRIVER-PATH---", options=chrome_options)

driver.get('https://cnn.com/')
soup = bs.BeautifulSoup(driver.page_source, 'lxml')

# Get Titles from HTML.
titles = soup.find_all('span', {'class': 'cd__headline-text'})
print(titles)

# Close ChromeDriver.
driver.close()
driver.quit()

输出:

[<span class="cd__headline-text"><strong>The West turned Aung San Suu Kyi into a saint. She was always going to disappoint </strong></span>, <span class="cd__headline-text"><strong>In Hindu-nationalist India, Muslims risk being branded infiltrators</strong></span>, <span class="cd__headline-text">Johnson may have stormed to victory, but he's got a problem</span>, <span class="cd__headline-text">Impeachment heads to full House after historic vote</span>, <span class="cd__headline-text">Supreme Court to decide on Trump's financial records</span>, <span class="cd__headline-text">Michelle Obama's message for Thunberg after Trump mocks her</span>, <span class="cd__headline-text">Actor Danny Aiello dies at 86</span>, <span class="cd__headline-text">The biggest risk at the North Pole isn't what you think</span>, <span class="cd__headline-text">US city declares state of emergency after cyberattack </span>, <span class="cd__headline-text">Reality TV show host arrested</span>, <span class="cd__headline-text">Big names in 2019 you may have mispronounced</span>, <span class="cd__headline-text"><strong>Morocco has Africa's 'first fully solar village'</strong></span>]

您可以从here下载ChromeDriver。

答案 2 :(得分:0)

我刚刚检查了一下。美国有线电视新闻网(CNN)似乎认识到您是通过编程方式尝试抓取该网站,并提供了404 /缺失页面(其中没有内容)而不是首页。

尝试使用Selenium之类的无头浏览器,例如像这样:

from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://cnn.com')
html = driver.page_source