使用Pythion BeautifulSoup进行网页抓取时出错:从github个人资料中提取内容

时间:2018-09-10 14:50:41

标签: python web-scraping beautifulsoup

这是python代码,用于使用BeautifulSoup库从github仓库中抓取内容。我面临错误:

  

“ NoneType”对象没有属性“文本””

在此简单代码中。我在代码中已注释的两行中遇到错误。

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for':'your-repos-filter'}) 

for row in table.find_all('li', attrs = {'itemprop':'owns'}): 
    repo = {}
    repo['name'] = row.find('div').find('h3').a.text
    #First Error Position
        repo['desc'] = row.find('div').p.text
        #Second Error Postion
    repo['lang'] = row.find('div', attrs = {'class':'f6 text-gray mt-2'}).find('span', attrs = {'class':'mr-3'}).text
    repos.append(repo) 

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

输出

  

回溯(最近一次通话最后一次):文件“ webscrapping.py”,第16行   在       repo ['desc'] = row.find('div')。p.text AttributeError:'NoneType'对象没有属性'text'

2 个答案:

答案 0 :(得分:0)

发生这种情况的原因是当您通过BeautifulSoup查找元素时,它的行为就像一个dict.get()调用。当您转到find个元素时,它会从元素树中get个元素。如果找不到它,则返回Exception,而不是引发NoneNone不具有Element所具有的属性,例如textattr等。因此,当您在没有{ {1}}或没有验证类型的情况下,您正在赌博,该元素将一直存在。

我可能只会先将导致问题的元素保留在temp变量中,这样您就可以键入check了。要么实施Element.text

类型检查

try/except

尝试/除外

try/except

就个人而言,我倾向于尝试/例外,因为它更加简洁,并且异常捕获是提高程序健壮性的好方法

答案 1 :(得分:0)

您的find调用不准确且被链接,因此,当您尝试查找没有<div>子代的p标记时,您会得到None,但是继续进行操作在.text上调用属性None,使用AttributeError会使程序崩溃。

尝试以下一组.find调用,这些调用使用您要使用的itemProp属性,并使用try-except块将所有丢失的字段归零:

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://github.com/DURGESHBARWAL?tab=repositories"
r = requests.get(URL) 

soup = BeautifulSoup(r.text, 'html.parser') 

repos = []
table = soup.find('ul', attrs = {'data-filterable-for': 'your-repos-filter'}) 

for row in table.find_all('li', {'itemprop': 'owns'}): 
    repo = {
        'name': row.find('a', {'itemprop' : 'name codeRepository'}),
        'desc': row.find('p', {'itemprop' : 'description'}),
        'lang': row.find('span', {'itemprop' : 'programmingLanguage'})
    }

    for k, v in repo.items():
        try: 
            repo[k] = v.text.strip()
        except AttributeError: pass

    repos.append(repo)

filename = 'extract.csv'
with open(filename, 'w') as f: 
    w = csv.DictWriter(f,['name','desc','lang'])
    w.writeheader() 
    for repo in repos: 
        w.writerow(repo)

调试输出(除书面CSV之外):

[   {   'desc': 'This a Django-Python Powered a simple functionality based '
                'Bot application',
        'lang': 'Python',
        'name': 'Sandesh'},
    {'desc': None, 'lang': 'Jupyter Notebook', 'name': 'python_notes'},
    {   'desc': 'Installing DSpace using docker',
        'lang': 'Java',
        'name': 'DSpace-Docker-Installation-1'},
    {   'desc': 'This Repo Contains the DSpace Installation Steps',
        'lang': None,
        'name': 'DSpace-Installation'},
    {   'desc': '(Official) The DSpace digital asset management system that '
                'powers your Institutional Repository',
        'lang': 'Java',
        'name': 'DSpace'},
    {   'desc': 'This Repo contain the DSpace installation steps with '
                'docker.',
        'lang': None,
        'name': 'DSpace-Docker-Installation'},
    {   'desc': 'This Repository contain the Intermediate system for the '
                'Collaboration and DSpace System',
        'lang': 'Python',
        'name': 'Community-OER-Repository'},
    {   'desc': 'A class website to share the knowledge and expanding the '
                'productivity through digital communication.',
        'lang': 'PHP',
        'name': 'class-website'},
    {   'desc': 'This is a POC for the Voting System. It is a precise '
                'design and implementation of Voting System based on the '
                'features of Blockchain which has the potential to '
                'substitute the traditional e-ballet/EVM system for voting '
                'purpose.',
        'lang': 'Python',
        'name': 'Blockchain-Based-Ballot-System'},
    {   'desc': 'It is a short describtion of Modern Django',
        'lang': 'Python',
        'name': 'modern-django'},
    {   'desc': 'It is just for the sample work.',
        'lang': 'HTML',
        'name': 'Task'},
    {   'desc': 'This Repo contain the sorting algorithms in C,predefiend '
                'function of C, C++ and Java',
        'lang': 'C',
        'name': 'Sorting_Algos_Predefined_functions'},
    {   'desc': 'It is a arduino program, for monitor the temperature and '
                'humidity from sensor DHT11.',
        'lang': 'C++',
        'name': 'DHT_11_Arduino'},
    {   'desc': "This is a registration from,which collect data from user's "
                'desktop and put into database after validation.',
        'lang': 'PHP',
        'name': 'Registration_Form'},
    {   'desc': 'It is a dynamic multi-part data driven search engine in '
                'PHP & MySQL from absolutely scratch for the website.',
        'lang': 'PHP',
        'name': 'search_engine'},
    {   'desc': 'It is just for learning github.',
        'lang': None,
        'name': 'Hello_world'}]