奇怪的beautifulsoup nonetype错误

时间:2013-11-29 21:41:30

标签: python memory beautifulsoup nonetype

我做了一个运行良好的擦洗器来获取我大学的所有课程(以后过滤它们),但它有时会突然出现奇怪的错误,比如`AttributeError:'NoneType'对象没有属性'findAll'。如果我转到另一个长页面,它会给我一个类似的错误。

我的代码:

from bs4 import BeautifulSoup
import urllib2
import datetime
import httplib
from math import floor
from random import randrange
import cPickle as pickle
[...irrelevant code...]
urls = ["http://locus.vub.ac.be/reporting/spreadsheet?identifier=DA&submit=toon%20de%20gegevens%20-%20show%20the%20teaching%20activities&idtype=name&template=Mod%2bSS&objectclass=module%2bgroup", "http://locus.vub.ac.be/reporting/spreadsheet?identifier=AL+tot+AP&submit=toon+de+gegevens+-+show+the+teaching+activities&idtype=name&template=Mod%2BSS&objectclass=module%2Bgroup"]
for url in urls:
    url = urllib2.urlopen(url).read()
    soup = BeautifulSoup(url)
    begins = soup.findAll("span", {"class" : "label-1-0-0"})
    for begin in begins:
        table = begin.findNext("table", {"class" : "spreadsheet"})
        #if table is not None:
        gegevens = table.findAll("tr")
        for i in range (1, len(gegevens)):
            naam = gegevens[i].td
            dag = naam.find_next_sibling("td")
            beginuur = dag.find_next_sibling("td")
            einduur = beginuur.find_next_sibling("td")
            duur = einduur.find_next_sibling("td")
            weken = duur.find_next_sibling("td")
            titularis = weken.find_next_sibling("td")
            lokaal = titularis.find_next_sibling("td")
            print naam.text + " " + dag.text + " " + beginuur.text + " " + einduur.text + " " + weken.text + " " + titularis.text + " " + lokaal.text

link 1的输出结果:

[...]
Discrete wiskunde (HOC) ma 18:00 21:00 4, 8, 11, 13 CARA PHILIPPE F.4.111
Discrete wiskunde (WPO2) ma 13:00 15:00 3-6, 8, 10-12, 14 Deneckere Tom E.0.12
Discrete wiskunde (HOC) wo 9:00 11:00 2-3, 6, 8-9, 11-14 CARA PHILIPPE E.0.07
Traceback (most recent call last):
  File "Untitled 7.py", line 24, in <module>
    titularis = weken.find_next_sibling("td")
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'

link 2的输出结果:

[...]
Algemeen boekhouden - WPO - TEW - groep 5 (E-M) ma 9:00 11:00 5-6 VANDENHAUTE Marie-Laure D.3.04
Algemeen boekhouden - WPO - HI - groep 1 (A-D) di 14:00 16:00 3-14 VANDENHAUTE Marie-Laure D.2.09
Algemeen boekhouden - WPO - HI - groep 3 (Q-Z) ma 9:00 11:00 3-8, 10-14 CEUSTERMANS Stefanie D.2.10
Algemeen boekhouden - WPO - HI - groep 2 (E-P) di 9:00 11:00 3-8, 10-11, 13-14 VANDENHAUTE Marie-Laure D.3.05
Approaches to language teaching & learning for multilingual education HOC- wo 10:00 12:00 2-9, 11-14 VAN DE CRAEN PIERRE E.3.05
Traceback (most recent call last):
  File "Untitled 7.py", line 16, in <module>
    gegevens = table.findAll("tr")
AttributeError: 'NoneType' object has no attribute 'findAll'

编辑:用BeautifulSoup(url)替换汤= soup = BeautifulSoup(url, "xml")(并导入lxml库)解决了这个问题。我不知道为什么......

1 个答案:

答案 0 :(得分:0)

好像来自urllib2.urlopen的错误。您应确保可以获取要在服务器上获取的页面,或正确处理异常。