BeautifulSoup汤递归

时间:2016-11-06 16:14:58

标签: python python-2.7 beautifulsoup

我希望以递归方式检索网页的网址,并将结果放在列表中。

这是我使用的代码:

catalog_url = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/"

from bs4 import BeautifulSoup #  conda install -c asmeurer beautiful-soup=4.3.2 
import urllib2
from datetime import datetime

html_page = urllib2.urlopen(catalog_url)
soup = BeautifulSoup(html_page)

urls_day = []
for link in soup.findAll('a'):
    if datetime.today().strftime('%Y') in link.get('href'): # String contains today's year in name
        print link.get('href')
        urls_day.append(link.get('href'))

urls_final = []
for run in urls_day:
    html_page2 = urllib2.urlopen(run)
    soup2 = BeautifulSoup(html_page2)
    for links in soup2.findAll('a'):
        if datetime.today().strftime('%Y') in soup2.get('a'):
            print links.get('href')
            urls_final.append(links.get('href'))

在第一个循环中,我得到了catalog_url中的网址。 urls_day是一个列表对象,其中的网址包含当前年份的字符串。

第二个循环失败,输出如下:

<a href="http://nomads.ncep.noaa.gov:9090/dods">GrADS Data Server</a>
Traceback (most recent call last):
  File "<stdin>", line 6, in <module>
TypeError: argument of type 'NoneType' is not iterable

urls_final应该是包含我感兴趣的网址的列表对象。

任何想法如何解决?我已经通过递归检查了类似的漂亮汤的帖子,但我总是得到相同的&#39; NoneType&#39;响应。

1 个答案:

答案 0 :(得分:0)

在调用递归函数之前,您应该检查返回的值是否为NoneType。我写了一个你可以改进的例子。

from bs4 import BeautifulSoup
from datetime import datetime
import urllib2

CATALOG_URL = "http://nomads.ncep.noaa.gov:9090/dods/gfs_0p25/"

today = datetime.today().strftime('%Y')

cache = {}


def cached(func):
    def wraps(url):
        if url not in cache:
            cache[url] = True
            return func(url)
    return wraps


@cached
def links_from_url(url):
    html_page = urllib2.urlopen(url)
    soup = BeautifulSoup(html_page, "lxml")
    s = set([link.get('href') for link in soup.findAll('a') if today in link.get('href')])
    return s if len(s) else url


def crawl(links):
    if not links:  # Checking for NoneType
        return
    if type(links) is str:
        return links
    if len(links) > 1:
        return [crawl(links_from_url(link)) for link in links]


if __name__ == '__main__':
    crawl(links_from_url(CATALOG_URL))
    print cache.keys()