Python Web Scraping随机失败

时间:2017-05-19 20:20:59

标签: python web-scraping beautifulsoup python-requests

我正在尝试为以下网站制作网站地图:

http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx

代码通过并首先确定顶层目录级别上有多少页面,然后它存储每个页面编号及其相应的链接。然后它遍历每个页面并创建一个包含每个3位数文件值的字典以及该值的相应链接。从那里开始,代码会为每个3位数目录创建另一个页面和链接的字典(这就是我被困住的地方)。完成后,目标是创建一个包含每个6位数文件编号及其相应链接的字典。

但是,代码在整个抓取过程中的某些点随机失败,并给出以下错误消息:

Traceback (most recent call last):
  File "C:\Scraping_Test.py", line 76, in <module>
    totalPages = totalPages.text
AttributeError: 'NoneType' object has no attribute 'text'

有时代码甚至不会运行并自动跳到程序的末尾而没有任何错误。

我目前正在运行python 3.6.0并在Visual Studio Community 2015上使用所有更新的库。任何帮助都将受到赞赏,因为我是编程新手。

import bs4 as bs
import requests
import re
import time



def stop():
    print('sleep 5 sec')
    time.sleep(5)

url0 = 'http://aogweb.state.ak.us'
url1 = 'http://aogweb.state.ak.us/WebLink/'


r = requests.get('http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx')
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()

pagesTopDic = {}
pagesTopDic['1'] = '/WebLink/0/fol/12497/Row1.aspx'
dig3Dic = {}
for link in soup.find_all('a'):             #find top pages
    if not link.get('title') is None:
        if 'page' in link.get('title').lower():
            page = link.get('title')
            page = page.split(' ')[1]
            #print(page)
            pagesTopDic[page] = link.get('href')

listKeys = pagesTopDic.keys()

for page in listKeys:                       #on each page find urls for beggining 3 digits
    url = url0 + pagesTopDic[page]
    r = requests.get(url)
    soup = bs.BeautifulSoup(r.content, 'lxml')
    print('Status: ' + str(r.status_code))
    stop()

    for link in soup.find_all('a'):
        if not link.get("aria-label") is None:
            folder = link.get("aria-label")
            folder = folder.split(' ')[0]
            dig3Dic[folder] = link.get('href')

listKeys = dig3Dic.keys()
pages3Dic = {}
for entry in listKeys: #pages for each three digit num
    print(entry)
    url = url1 + dig3Dic[entry]
    r = requests.get(url)
    soup = bs.BeautifulSoup(r.content, 'lxml')
    print('Status: ' + str(r.status_code))
    stop()

    tmpDic = {}
    tmpDic['1'] = '/Weblink/' + dig3Dic[entry]


    totalPages = soup.find('div',{"class": "PageXofY"})
    print(totalPages)
    totalPages = totalPages.text
    print(totalPages)
    totalPages = totalPages.split(' ')[3]
    print(totalPages)
    while len(tmpDic.keys()) < int(totalPages):

        r = requests.get(url)
        soup = bs.BeautifulSoup(r.content, 'lxml')
        print('Status: ' + str(r.status_code))
        stop()

        for link in soup.find_all('a'):             #find top pages
            if not link.get('title') is None:
                #print(link.get('title'))
                if 'Page' in link.get('title'):
                    page = link.get('title')
                    page = page.split(' ')[1]
                    tmpDic[page] = link.get('href')
        num = len(tmpDic.keys())
        url = url0 + tmpDic[str(num)]

    print()
    pages3Dic[entry] = tmpDic

0 个答案:

没有答案