由于页面加载速度太慢,BeautifulSoup无法解析内容

时间:2016-05-17 10:18:47

标签: python parsing web-scraping request beautifulsoup

我一直在尝试从网址http://podaac.jpl.nasa.gov/ws/search/granule/index.html解析一个表,但表格太大了,加载网站后加载表需要几毫秒。由于美丽的汤采取网站返回的第一个实例,它不能加载完整的表,但只加载表的标题。

from bs4 import BeautifulSoup as bs 
import requests

datasetIds = []
html = requests.get('http://podaac.jpl.nasa.gov/ws/search/granule/index.html')
soup = bs(html.text, 'html.parser')

table = soup.find("table", {"id": "tblDataset"})
print table
rows = table.find_all('tr')
rows.remove(rows[0])

for row in rows:
   x = row.find_all('td')
   datasetIds.append(x[1].text.encode('utf-8'))

print datasetIds

代码必须返回第一个表中的datasetIds,但它只返回表的标题。在此先感谢您的帮助! :)

2 个答案:

答案 0 :(得分:1)

使用ajax请求检索数据,您可以从返回格式良好的json中获取数据:

json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/dataset/select/?q=*:*&fl=Dataset-PersistentId,Dataset-ShortName-Full&rows=2147483647&fq=DatasetPolicy-AccessType-Full:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+DatasetPolicy-ViewOnline:Y&wt=json").json()
print(json)

我们只需要使用几个键:

from pprint import pprint as pp

pp(json["response"]["docs"])

输出片段:

[{'Dataset-PersistentId': 'PODAAC-MODST-M8D9N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_8DAY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-MAN4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MMO9N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-M1D9N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-GHMTG-2PN01',
  'Dataset-ShortName-Full': 'NAVO-L2P-AVHRRMTA_G'},
 {'Dataset-PersistentId': 'PODAAC-GHBDM-4FD01',
  'Dataset-ShortName-Full': 'DMI-L4UHfnd-NSEABALTIC-DMI_OI'},
 {'Dataset-PersistentId': 'PODAAC-GHGOY-4FE01',
  'Dataset-ShortName-Full': 'EUR-L4HRfnd-GLOB-ODYSSEA'},
 {'Dataset-PersistentId': 'PODAAC-GHMED-4FE01',
  'Dataset-ShortName-Full': 'EUR-L4UHFnd-MED-v01'},
 {'Dataset-PersistentId': 'PODAAC-NSGDR-L2X02',
  'Dataset-ShortName-Full': 'NSCAT_LEVEL_2_V2'},
 {'Dataset-PersistentId': 'PODAAC-MODST-M1D4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MMO4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODST-MMO4N',
  'Dataset-ShortName-Full': 'MODIS_TERRA_L3_SST_MID-IR_MONTHLY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-MAN9N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_ANNUAL_9KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-M8D4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_8DAY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-MODSA-M1D4N',
  'Dataset-ShortName-Full': 'MODIS_AQUA_L3_SST_MID-IR_DAILY_4KM_NIGHTTIME'},
 {'Dataset-PersistentId': 'PODAAC-GOES3-24HOR',
  'Dataset-ShortName-Full': 'GOES_L3_SST_6km_NRT_SST_24HOUR'},

从表中提供所有数据集ID 短名称对,而不需要bs4。

要获取ID,您只需使用密钥Dataset-PersistentId

访问每个字典
for d in json["response"]["docs"]:
    print("ID for {Dataset-ShortName-Full} is {Dataset-PersistentId}".format(**d) )

一些输出:

ID for OSTM_L2_OST_OGDR_GPS is PODAAC-J2ODR-GPS00
ID for JPL-L4UHblend-NCAMERICA-RTO_SST_Ad is PODAAC-GHRAD-4FJ01
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-SEABY-ANBIM
ID for SEAWINDS_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-SEABY-ANBML
ID for CCMP_MEASURES_ATLAS_L4_OW_L3_5A_5DAY_WIND_VECTORS_FLK is PODAAC-CCF35-01AD5
ID for QSCAT_BYU_L3_OW_SIGMA0_ARCTIC_POLAR-STEREOGRAPHIC_BROWSE_MAPS_LITE is PODAAC-QSBYU-ARBML
ID for MODIS_AQUA_L3_SST_MID-IR_ANNUAL_4KM_NIGHTTIME is PODAAC-MODSA-MAN4N
ID for UCLA_DEALIASED_SASS_L3 is PODAAC-SASSX-L3UCD
ID for NSCAT_LEVEL_1.7_V2 is PODAAC-NSSDR-17X02
ID for NSCAT_LEVEL_3_V2 is PODAAC-NSJPL-L3X02
ID for AVHRR_NAVOCEANO_L3_18km_MCSST_DAYTIME is PODAAC-NAVOC-318DY
ID for QSCAT_L3_OW_JPL_BROWSE_IMAGES is PODAAC-QSXXX-L3BI0
ID for QSCAT_BYU_L3_OW_SIGMA0_ANTARCTICA_POLAR-STEREOGRAPHIC_BROWSE_IMAGES is PODAAC-QSBYU-ANBIM
ID for NAVO-L4HR1m-GLOB-K10_SST is PODAAC-GHK10-41N01
ID for NCDC-L4LRblend-GLOB-AVHRR_AMSR_OI is PODAAC-GHAOI-4BC01
ID for SEAWINDS_LEVEL_3_V2 is PODAAC-SEAXX-L3X02

有第二个ajax请求返回更多数据:

json = requests.get("http://podaac.jpl.nasa.gov/dmasSolr/solr/granule/select/?q=*&fq=Granule-AccessType:(OPEN+OR+PREVIEW+OR+SIMULATED+OR+REMOTE)+AND+Granule-Status:ONLINE&facet=true&facet.field=Dataset-ShortName-Full&rows=0&facet.limit=-1&facet.mincount=1&wt=json").json()
from pprint import pprint as pp

pp(json)

您还可以更改一些参数以提供不同的输出。

答案 1 :(得分:0)

As @Hassan Mehmood mentioned, you have to use selenium (or any other headles browser) for this, because the table is generated with javascript. beautifulsoup does not evaluate javascript and can not be used to get the desired data.

You can use this as a staring point:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

log = logging.getLogger(__name__)

logging.getLogger('selenium').setLevel(logging.WARNING)
logging.getLogger('requests').setLevel(logging.WARNING)


def test():
    url = 'http://podaac.jpl.nasa.gov/ws/search/granule/index.html'
    wait_for_element = 30

    s = webdriver.PhantomJS()
    s.set_window_size(1274, 826)
    s.set_page_load_timeout(45)
    s.get(url)

    WebDriverWait(s, wait_for_element).until(
        EC.presence_of_element_located((By.CLASS_NAME, "detailTABLE")))

    datasets = s.find_elements_by_class_name("detailTABLE")

    for item in datasets:
        print item.text

if __name__ == '__main__':
    test()