Simple Web Scraper不会打印任何内容。问题是什么?

时间:2019-11-18 01:12:12

标签: python web-scraping beautifulsoup

import requests
from bs4 import BeautifulSoup as bs

results = requests.get("https://www.cnn.com")
src = results.content
soup = bs(src, 'lxml')

urls = []

for h3_tag in soup.find_all("h3"):
    a_tag = h3_tag.find("a")
    urls.append(a_tag.attrs["href"])

for url in urls:
    print(url + "\n")
print(urls)

由于某种原因,我的程序正在打印一个空列表,而且我似乎无法弄清楚问题出在哪里。我很确定错误是在第一个for循环中,但我不确定。

1 个答案:

答案 0 :(得分:2)

在尝试通过请求将其拉出之前,该网页尚未完全加载,因此尚无可渲染的h3标签。这是因为其中许多元素都是使用javascript呈现的。您可以使用网络浏览器自动化(例如Selenium)来解决此问题。

在此示例中,我使用了Mozilla Geckodriver,可以从release page here下载。

from bs4 import BeautifulSoup as bs
from selenium import webdriver

# load the driver
driver = webdriver.Firefox(executable_path='Development/webdrivers/geckodriver')

# get the content and pass to BS
driver.get('https://www.cnn.com')
html = driver.page_source
soup = bs(html, 'lxml')

# get links (simplified using list comprehension)
urls = [h3_tag.find("a").attrs["href"] for h3_tag in soup.find_all("h3")]

# result
print(urls)

# close the driver
driver.close()

输出

['/2019/11/18/politics/ukraine-zelensky-pressure-trump-investigations/index.html',
 '/2019/11/18/politics/house-investigating-trump-lying-to-mueller/index.html',
 '/2019/11/18/politics/trump-tax-documents-supreme-court/index.html',
 '/2019/11/18/politics/house-ways-means-irs-whistleblower/index.html',
 '/videos/politics/2019/11/18/trump-walter-reed-visit-jonathan-reiner-nr-vpx.cnn',
 '/2019/11/18/asia/hong-kong-poly-university-protest-police-intl-hnk/index.html',
 '/2019/11/18/asia/south-china-sea-intl-hnk/index.html',
 '/2019/11/18/politics/pompeo-west-bank-settlements-announcement/index.html',
 '/2019/11/18/uk/prince-andrew-has-thrown-a-fireblanket-over-the-brexit-election-intl-ge19-gbr/index.html',
 '/2019/11/18/uk/jennifer-arcuri-boris-johnson-interview-ge19-gbr-intl/index.html',
 '/2019/11/18/asia/north-korea-us-meeting-intl/index.html',
 '/2019/11/18/africa/france-returns-stolen-sword-to-senegal/index.html',
 '/2019/11/18/us/fresno-mass-shooting-football-party/index.html',
 '/2019/11/18/football/ahmad-mendes-moreira-racist-abuse-fc-den-bosch-excelsior-spt-intl/index.html',
 '/2019/11/18/health/china-bubonic-plague-intl-hnk-scn-scli/index.html',
 '/travel/article/will-i-am-qantas-racism-row-intl-scli/index.html',
 '/2019/11/18/uk/blind-student-oxford-union-scli-intl-gbr/index.html',
 '/2019/11/18/middleeast/iran-protests-explained-intl/index.html',
 '/2019/11/18/business/coty-kylie-cosmetics-deal/index.html',
 '/2019/11/18/sport/israel-folau-bushfires-intl-spt/index.html',
 '/2019/11/18/business/airbus-emirates-dubai-air-show/index.html',
 '/2019/11/18/us/oklahoma-walmart-shooting/index.html',
 '/2019/11/18/us/minnesota-twins-prospect-ryan-costello-dead-trnd/index.html',
 '/travel/article/unruly-airplane-passengers/index.html',
 '/style/article/china-beijing-silvermine-negatives/index.html',
 '/2019/11/18/world/bizarre-basking-shark-scn-trnd/index.html',
 '/style/article/banksy-drinker-sale-intl-scli/index.html',
 '/2019/11/18/business/marie-kondo-online-shop/index.html',
 '/2019/11/18/tennis/tsitsipas-atp-finals-tennis-spt-intl/index.html',
 '/2019/11/18/health/samoa-measles-emergency-intl-scli/index.html',
 '/2019/11/18/africa/bogaletch-gebre-obit-trnd/index.html',
 '/style/article/the-crown-royal-fashion/index.html',
 '/travel/article/thailand-bullet-trains/index.html',
 '/2019/11/18/entertainment/prince-philips-mother-princess-alice-interesting-facts-intl-scli/index.html',
 '/2019/11/18/opinions/trump-assault-weapons-export-abramson/index.html',
 '/2019/11/17/opinions/donald-trump-magic-evaporating-campaign-trail-obeidallah/index.html',
 '/2019/11/16/opinions/this-is-life-counterterrorism-la-bau/index.html',
 '/2019/11/18/perspectives/andrew-yang-technology/index.html']