Question

我知道KeyErrors在BeautifulSoup中相当常见，在你向我大喊RTFM之前，我已经在Python文档和BeautifulSoup文档中做了大量阅读。现在，除此之外，我仍然不知道KeyErrors正在发生什么。

这是我正在尝试运行的程序，它始终如一地导致URL列表的 last 元素出现KeyError。

我来自C ++背景，只是为了让你知道，但我需要使用BeautifulSoup来工作，在C ++中这样做是一个可以想象的噩梦！

我们的想法是返回网站中所有网址的列表，这些网址在其网页上包含指向某个网址的链接。

这是我到目前为止所得到的：

import urllib
from BeautifulSoup import BeautifulSoup

URLs = []
Locations = []
URLs.append("http://www.tuftsalumni.org")

def print_links (link):
    if (link.startswith('/') or link.startswith('http://www.tuftsalumni')):
        if (link.startswith('/')):
            link = "STARTING_WEBSITE" + link
        print (link)
        htmlSource = urllib.urlopen(link).read(200000)
        soup = BeautifulSoup(htmlSource)
        for item in soup.fetch('a'):
            if (item['href'].startswith('/') or 
                "tuftsalumni" in item['href']):
                URLs.append(item['href'])
            length = len(URLs)
            if (item['href'] == "SITE_ON_PAGE"):
                if (check_list(link, Locations) == "no"):
                    Locations.append(link)



def check_list (link, array):
    for x in range (0, len(array)):
        if (link == array[x]):
            return "yes"
    return "no"

print_links(URLs[0])

for x in range (0, (len(URLs))):
    print_links(URLs[x])

我得到的错误是网址的最后一个元素：

File "scraper.py", line 35, in <module>
    print_links(URLs[x])
  File "scraper.py", line 16, in print_links
    if (item['href'].startswith('/') or 
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-   packages/BeautifulSoup.py", line 613, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

现在我知道我需要使用get（）来处理KeyError默认情况。我绝对不知道如何实际做到这一点，尽管只需要一个小时的搜索。

谢谢，如果我能澄清一下，请告诉我。

Answer 1

如果您只想处理错误，可以捕获异常：

    for item in soup.fetch('a'):
        try:
            if (item['href'].startswith('/') or "tuftsalumni" in item['href']):
            (...)
        except KeyError:
            pass # or some other fallback action

您可以使用item.get('key','default')指定默认值，但我认为在这种情况下您不需要这样做。

编辑：如果其他一切都失败了，这是一个准确的起点，应该是一个合理的起点：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import urllib
from BeautifulSoup import BeautifulSoup

links = ["http://www.tuftsalumni.org"]

def print_hrefs(link):
    htmlSource = urllib.urlopen(link).read()
    soup = BeautifulSoup(htmlSource)
    for item in soup.fetch('a'):
        print item['href']

for link in links:
    print_hrefs(link)

此外，check_list(item, l)可以替换为item in l。

BeautifulSoup KeyError问题

1 个答案: