使用Python Web抓取获取链接的网址;请求,requests_html,硒

时间:2020-10-28 02:55:13

标签: python selenium-webdriver web-scraping python-requests python-requests-html

我是网络爬虫的新手,在遇到USGS地震的数据链接时遇到问题。我尝试从中获取数据的网址是:https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity

我正在尝试自动收集这些数据,因此我不必在每次地震后手动进行收集。我要提取的数据的网址是一致的,除了我拥有的地震ID和似乎没有任何关联的数字外,所以我想我可以通过网络获取该网址刮。

如果您查看该页面,则会出现一个下拉菜单,称为带有不同数据产品的下载。我正在尝试获取DYFI地理空间数据的网址,UTM汇总(间隔10 km),以便可以使用curl提取geojson文件。

我对网页抓取或html代码一无所知,而我尝试的大部分内容都是基于在这里和youtube上找到的内容。

我尝试过的事情:

我尝试使用请求获取html并将其解析为漂亮的汤,但是页面是动态生成的,因此过来的html不包含我想要的内容。

import requests
import bs4 #beautiful soup

res = requests.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.find_all('a'):
    print(link)

这将输出三个链接,但不是我需要的链接:

<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and Web Services</a>
<a href="https://angular.io/guide/browser-support">view supported
            browsers</a>
<a href="/earthquakes/feed/">Real-time Notifications, Feeds, and
            Web Services</a>

我认为USGS网站使用javascript填充了下拉下载菜单,这就是为什么常规请求方法不起作用的原因,所以我认为我可能会尝试使用硒。我希望它能给我使用检查元素工具时可以看到的html,但是我没有任何运气。

from selenium import webdriver
path = "/Users/jon/Desktop/selenium_webdriver/chromedriver" #path to chromedriver on my machine
driver = webdriver.Chrome(executable_path=path)
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')
html_eq = driver.page_source
soup = bs4.BeautifulSoup(html_eq, 'html.parser')
for link in soup.find_all('a'):
    print(link) 

与我最初的尝试相比,这会输出更多的链接,但并没有为我获得所需的链接。 这是我的硒尝试的输出:

<a _ngcontent-fgi-c8="" class="hazdev-site-logo" href="/" title="U.S. Geological Survey"><img _ngcontent-fgi-c8="" alt="U.S. Geological Survey logo" src="assets/usgs-logo.svg"/></a>
<a _ngcontent-fgi-c8="" class="hazdev-jumplink-navigation" href="#site-sectionnav">Jump to Navigation</a>
<a _ngcontent-fgi-c5="" class="up-one-level ng-star-inserted" href="/earthquakes/map/" templatesidenavigation=""> Latest Earthquakes </a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/executive" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Overview </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Interactive Map </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/region-info" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Regional Information </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Impact </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/tellus" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Felt Report - Tell Us! </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted active-link" href="/earthquakes/eventpage/us7000bi0e/dyfi" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Did You Feel It? </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/technical" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Technical </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/origin" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Origin </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/waveforms" mat-list-item="" routerlinkactive="active-link"><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Waveforms </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/feed/v1.0/detail/us7000bi0e.kml" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Download Event KML </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/map/#%7B%22autoUpdate%22%3Afalse%2C%22basemap%22%3A%22terrain%22%2C%22event%22%3A%22us7000bi0e%22%2C%22feed%22%3A%22us7000bi0e%22%2C%22mapposition%22%3A%5B%5B6.104279985601153%2C-85.06432001439885%5D%2C%5B10.603920014398849%2C-80.56467998560115%5D%5D%2C%22search%22%3A%7B%22id%22%3A%22us7000bi0e%22%2C%22isSearch%22%3Atrue%2C%22name%22%3A%22Search%20Results%22%2C%22params%22%3A%7B%22endtime%22%3A%222020-09-25T17%3A46%3A43.975Z%22%2C%22latitude%22%3A8.3541%2C%22longitude%22%3A-82.8145%2C%22maxradiuskm%22%3A250%2C%22minmagnitude%22%3A2%2C%22starttime%22%3A%222020-08-14T17%3A46%3A43.975Z%22%7D%7D%7D" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> View Nearby Seismicity </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/earthquakes/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Earthquakes </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/hazards/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Hazards </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/data/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Data &amp; Products </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/learn/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Learn </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/monitoring/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Monitoring </div></a>
<a _ngcontent-fgi-c15="" class="mat-list-item ng-star-inserted" href="/research/" mat-list-item=""><div class="mat-list-item-content"><div class="mat-list-item-ripple mat-ripple" mat-ripple=""></div><div class="mat-list-text"></div> Research </div></a>
<a _ngcontent-fgi-c18="" class="tell-us-link" href="/earthquakes/eventpage/us7000bi0e/tellus" queryparamshandling="preserve"> Felt Report - Tell Us! </a>
<a _ngcontent-fgi-c22=""> View all dyfi products (1 total) </a>
<a _ngcontent-fgi-c20="" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity"> US </a>
<a _ngcontent-fgi-c18="" aria-current="true" aria-disabled="false" class="mat-tab-link ng-star-inserted mat-tab-label-active" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/zip" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> ZIP Map </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/intensity-vs-distance" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Intensity Vs. Distance </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses-vs-time" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> Responses Vs. Time </a>
<a _ngcontent-fgi-c18="" aria-current="false" aria-disabled="false" class="mat-tab-link ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/dyfi/responses" mat-tab-link="" queryparamshandling="preserve" routerlinkactive="" tabindex="0"> DYFI Responses </a>
<a _ngcontent-fgi-c28="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/map?dyfi-responses-10km=true&amp;shakemap-intensity=false"><img _ngcontent-fgi-c28="" alt="DYFI intensity map" src="https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/us7000bi0e_ciim_geo.jpg"/></a>
<a _ngcontent-fgi-c23="" href="/earthquakes/eventpage/us7000bi0e">Overview</a>
<a _ngcontent-fgi-c32="" class="ng-star-inserted" href="/earthquakes/eventpage/us7000bi0e/impact"> Impact Summary </a>
<a _ngcontent-fgi-c18="" href="https://earthquake.usgs.gov/data/dyfi/">Scientific Background for Did You Feel It?</a>
<a href="https://earthquake.usgs.gov/data/comcat/contributor/us/">USGS National Earthquake Information Center, PDE</a>
<a _ngcontent-fgi-c7="" href="/data/comcat/"> ANSS Comprehensive Earthquake Catalog (ComCat) Documentation </a>
<a _ngcontent-fgi-c7="" href="/data/comcat/data-eventterms.php"> Technical terms used on event pages </a>
<a _ngcontent-fgi-c11="" href="mailto:lisa%2Behpweb@usgs.gov">Questions or comments?</a>
<a _ngcontent-fgi-c11="" class="facebook ng-star-inserted" href="https://www.facebook.com/sharer.php?u=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Facebook">Facebook</a>
<a _ngcontent-fgi-c11="" class="twitter ng-star-inserted" href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity&amp;text=USGS%20%7C%20M 5.3 - 1 km NNW of Manaca Norte, Panama" title="Share using Twitter">Twitter</a>
<a _ngcontent-fgi-c11="" class="email ng-star-inserted" href="mailto:lisa%2Behpweb@usgs.gov?to=&amp;subject=M 5.3 - 1 km NNW of Manaca Norte, Panama&amp;body=https%3A%2F%2Fearthquake.usgs.gov%2Fearthquakes%2Feventpage%2Fus7000bi0e%2Fdyfi%2Fintensity" title="Share using Email">Email</a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/"> Home </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/aboutus/"> About Us </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/contactus/"> Contacts </a>
<a _ngcontent-fgi-c13="" class="ng-star-inserted" href="/legal.php"> Legal </a>

我发现了一个YouTube教程,该教程使用了我认为可能可以使用的request_html进行网页抓取:https://www.youtube.com/watch?v=MeBU-4Xs2RU 我可以得到他在视频中提供的与啤酒网站一起使用的示例,但我无法将其应用于我的情况。

这是我尝试过的代码,

from requests_html import HTMLSession

url_usgs = 'https://earthquake.usgs.gov/earthquakes/eventpage/us7000biji/dyfi/intensity'

r_usgs = s.get(url_usgs)

r_usgs.html.render(sleep=1)

downloads = r_usgs.html.xpath('//*[@id="mat-expansion-panel-header-0"]', first=True)
print(downloads.absolute_links)

这什么也没返回。我不懂html,所以我可能选择了错误的项目的xpath来使用。

如果有人对如何从下载菜单(https://earthquake.usgs.gov/archive/product/dyfi/us7000biji/us/1601214674370/dyfi_geo_10km.geojson中获取10公里dyfi数据的网址有任何想法,或者可以将我引向更多有关Web抓取的更深入的资料,感谢它。

1 个答案:

答案 0 :(得分:1)

您需要点击“下载”菜单以扩展内容。

from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
import time


driver = webdriver.Chrome()
driver.get('https://earthquake.usgs.gov/earthquakes/eventpage/us7000bi0e/dyfi/intensity')

# get a reference to the download menu. This will run before the page has 
# finished loading, so we stick it in a while loop and just keep looping
# until we're successful.
while True:
    try:
        download_menu = driver.find_element_by_id('mat-expansion-panel-header-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

# click on the download menu to expand the content
download_menu.click()

while True:
    try:
        downloads = driver.find_element_by_id('cdk-accordion-child-0')
    except NoSuchElementException:
        time.sleep(0.2)
        continue
    else:
        break

links = downloads.find_elements_by_css_selector('a')
geojson = [link for link in links if 'geojson' in link.text.lower()]

for link in geojson:
    print(link.text, ':', link.get_attribute('href'))


driver.close()

哪个会产生:

GEOJSON 645.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_zip.geojson
GEOJSON 844.0 B : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_1km.geojson
GEOJSON 1.0 KB : https://earthquake.usgs.gov/archive/product/dyfi/us7000bi0e/us/1601053020563/dyfi_geo_10km.geojson

...,当然,您可以检查href属性的值以找到10公里数据(通过在链接中查找包含10km的数据)。