Question

我试图制作一个简单的程序来从网页中的段落中提取单词。我的代码看起来像这样-

import requests
from bs4 import BeautifulSoup
import operator

def start(url):
    word_list = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code)
    for post_text in soup.find_all('p'):
        cont = post_text.string
        words = cont.lower().split()
        for each_word in words:
            print(each_word)
            word_list.append(each_word)


start('https://lifehacker.com/why-finding-your-passion-isnt-enough-1826996673')

首先，我收到此警告- UserWarning：未明确指定解析器，因此我正在为此系统使用最佳的HTML解析器（“ html.parser”）。通常这不是问题，但是如果您在其他系统或不同的虚拟环境中运行此代码，则它可能使用不同的解析器并且行为不同。

引起此警告的代码在文件D：/Projects/Crawler/Main.py的第17行上。要消除此警告，请更改如下代码：

BeautifulSoup(YOUR_MARKUP})

对此：

 BeautifulSoup(YOUR_MARKUP, "html.parser")

  markup_type=markup_type))

然后最后是这个错误：

Traceback (most recent call last):
  File "D:/Projects/Crawler/Main.py", line 17, in <module>
    start('https://lifehacker.com/why-finding-your-passion-isnt-enough-1826996673')
  File "D:/Projects/Crawler/Main.py", line 11, in start

    words = cont.lower().split()

AttributeError：'NoneType'对象没有属性'lower'

我尝试搜索，但无法解决或理解问题。

Answer 1

您正在使用段落标签<p>来解析该页面，但是该标签并不总是具有与之关联的文本内容。例如，如果您要运行：

def start(url):
    word_list = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code)
    for post_text in soup.find_all('p'):
        print(post_text)

您会发现您从诸如<p class="ad-label=bottom"></p>之类的广告中获得了成功。正如其他人在评论中指出的那样，None类型没有字符串方法，这实际上就是您的错误所指的。

防止这种情况的一种简单方法是将函数的一部分包装在try/except block中：

for post_text in soup.find_all('p'):
    try:
        cont = post_text.string
        words = cont.lower().split()
        for each_word in words:
            print(each_word)
            word_list.append(each_word)
    except AttributeError:
        pass

AttributeError：'NoneType'对象没有属性'lower'和警告

1 个答案: