美丽的汤不改变对象降低

时间:2015-11-30 12:13:22

标签: python beautifulsoup

所以我得到的错误是:

'NoneType'对象没有属性'lower'

问题是,它在我创建第二种方法之前就已经开始了,但现在却很有气质。我刚刚开始使用pycharm,所以我对场景很新

这是我的代码:

import requests
import sys
from bs4 import BeautifulSoup
import operator

def start(url):
    word_list = []
    source_code = requests.get(url).text
    soup = BeautifulSoup(source_code, 'html.parser')
    for post_text in soup.find_all('p'):
        content = post_text.string
        words = content.lower().split()
        for word in words:
            word_list.append(word)
    clean_up_list(word_list)

def clean_up_list(word_list):
    clean_word_list = []
    for word in word_list:
        accepted = "abcdefghijklmnopqrstuvwxyz\'"
        for c in list(word):
            if c not in list(accepted):
                word = word.replace(c, "")
        if len(word) > 0:
            print(word)
            clean_up_list().append(word)


start('http://www.nameofwebsite.com/')

2 个答案:

答案 0 :(得分:1)

这是因为post_text.string没有文字属性

这是其中一个p标签中没有文字。所以它返回了None

因此,当您执行words = content.lower().split()时,您实际上是在尝试应用.lower() on None which does not have a .lower attribute

您可以做的是添加if statement

修改:

for post_text in soup.find_all('p'):
    content = post_text.string
    if content is None: #  Checking if content is None
         continue
    words = content.lower().split()

答案 1 :(得分:1)

以下是一个会导致错误的示例:

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    '<p><div>hello</div><div>world</div></p>',
    'html.parser'
)

for p in soup.find_all('p'):
    print(repr(p.string))

--output:--
None

来自BeautifulSoup docs

  

<强> .string
  如果代码 只有一个孩子 ,并且该子代是NavigableString,   该子项以.string

的形式提供

您可以使用get_text()

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    '<p><div>hello</div><div>world</div>',
    'html.parser'
)

for p in soup.find_all('p'):
    print(p.get_text())

--output:--
helloworld

.strings

from bs4 import BeautifulSoup

soup = BeautifulSoup(
    '<p><div>hello</div><div>world</div></p>',
    'html.parser'
)

for p in soup.find_all('p'):
    for string in p.strings:
        print(string)

--output:--
hello 
world

但是.strings也会返回空格(空格,制表符,换行符):

from bs4 import BeautifulSoup

soup = BeautifulSoup(
'''
<p> <---newline there (plus spaces or tab at start of next line)
  <div>hello</div> <--newline there (plus spaces or tab at start of next line)
  <div>world</div> <--newline there
</p>
''',

    'html.parser'
)

for p in soup.find_all('p'):
    for string in p.strings:
        print(string)

--output:--


hello


world

要跳过空白,可以使用.stripped_strings

from bs4 import BeautifulSoup

soup = BeautifulSoup(
'''
<p>
  <div>hello</div>
  <div>world</div>
</p>
''',

    'html.parser'
)

for p in soup.find_all('p'):
    for string in p.stripped_strings:
        print(string)

--output:--
hello
world