Question

我正在编写我的第一个“真实”项目，一个网络爬虫，我不知道如何修复此错误。这是我的代码

55 AA

这是错误

VDI

Answer 1

第一个＆＃34; a＆＃34;维基百科页面上的链接是

<a id="top"></a>

因此，link.get（＆＃34; href＆＃34;）将返回None，因为没有href。

要解决此问题，请先检查无：

if link.get('href') is not None:
    href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
    # do stuff here

Answer 2

并非所有锚点（<a>元素）都需要href属性（请参阅https://www.w3schools.com/tags/tag_a.asp）：

在HTML5中，标记始终是超链接，但如果它没有href属性，则它只是超链接的占位符。

实际上你已经得到了Exception，Python非常善于处理异常，为什么不抓住异常？这种风格称为"Easier to ask for forgiveness than permission." (EAFP)，实际上是鼓励的：

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"):
            # The following part is new:
            try:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)
            except TypeError:
                pass

main_spider(1)

此外，page = 1和page += 1行也可以省略。 for page in range(1, max_pages+1):指令已经足够了。

Answer 3

如@Shiping所述，您的代码没有正确缩进...我在下面更正了它。另外...... link.get('href')在其中一个案例中没有返回字符串。

import requests
from bs4 import BeautifulSoup

def main_spider(max_pages):
    for page in range(1, max_pages+1):
        url = "https://en.wikipedia.org/wiki/Star_Wars" + str(page)
        source_code = requests.get(url)
        plain_text = source_code.text
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 

            href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
            print(href)

main_spider(1)

为了评估发生的事情，我在几条现有线路之间添加了几行代码，并删除了有问题的线路（暂时）。

        soup = BeautifulSoup(plain_text, "html.parser")
        print('All anchor tags:', soup.findAll('a'))     ### ADDED
        for link in soup.findAll("a"): 
            print(type(link.get("href")), link.get("href"))  ### ADDED

我添加的结果是这个（为简洁而截断）：注意：第一个锚没有href属性，因此link.get('href')无法返回值，因此返回None

[<a id="top"></a>, <a href="#mw-head">navigation</a>, 
<a href="#p-search">search</a>, 
<a href="/wiki/Special:SiteMatrix" title="Special:SiteMatrix">sister...   
<class 'NoneType'> None
<class 'str'> #mw-head
<class 'str'> #p-search
<class 'str'> /wiki/Special:SiteMatrix
<class 'str'> /wiki/File:Wiktionary-logo-v2.svg      
...

为了防止出错，可能的解决方案是在代码中添加条件OR或try / except表达式。我将演示一个条件表达式。

        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll("a"): 
            if link.get('href') == None:
                continue
            else:
                href = "https://en.wikipedia.org/wiki/Star_Wars" + link.get("href")
                print(href)

Answer 4

我有来自不同代码的相同错误。在函数中添加条件后，我认为返回类型设置不正确，但是我意识到，当条件为False时，根本就不会调用return语句-对我的缩进所做的更改修复了问题。

TypeError：必须是str，而不是NoneType

4 个答案: