尝试使用BeautifulSoup解析表时出现NoneType错误

时间:2017-01-12 16:01:25

标签: python parsing web-scraping beautifulsoup

这是我的代码:

source = urllib.request.urlopen('http://nflcombineresults.com/nflcombinedata_expanded.php ?year=2015&pos=&college=').read()
soup = bs.BeautifulSoup(source, 'lxml')

table = soup.table
table = soup.find(id='datatable')
table_rows = table.find_all('tr')
#print(table_rows)

year = []
name = []
college = []
pos = []
height = []
weight = []
hand_size = []
arm_length = []
wonderlic = []
fortyyrd = []

for row in table_rows[1:]:
    col = row.find_all('td')
    #row = [i.text for i in td]
    #print(col[4])

    # Create a variable of the string inside each <td> tag pair,
    column_1 = col[0].string.strip()
    # and append it to each variable
    year.append(column_1)

    column_2 = col[1].string.strip()
    name.append(column_2)

    column_3 = col[2].string.strip()
    college.append(column_3)

    column_4 = col[3].string.strip()
    pos.append(column_4)

    #print(col[4])
    column_5 = col[4].string.strip()
    height.append(column_5)

我想要添加的表中还有几列,但每当我尝试运行最后两行时,我都会收到错误消息:

  

“AttributeError:'NoneType'对象没有属性'strip'”

当我在这行上方打印col [4]时,我得到:

<td><div align="center">69</div></td> 

我原本以为这是由于缺少数据,但网站原始表中第一个缺失数据的实例位于第一行的第9列(Wonderlic),而不是第4列。

我想要添加到我的数据帧中的代码片段中还没有包含其他几个列,尽管该单元格中有条目,但我也会收到它们的NoneType错误。

我很擅长使用BeautifulSoup从网站解析表格,所以这可能是一个愚蠢的问题,但是为什么这个对象NoneType如何解决这个问题所以我可以把这个表放到pandas数据帧中?

3 个答案:

答案 0 :(得分:1)

import requests
from bs4 import BeautifulSoup 

r = requests.get('http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=')
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
    row = [td.text for td in tr.find_all('td')]
    print (row)

出:

['Year', 'Name', 'College', 'POS', 'Height (in)', 'Weight (lbs)', 'Hand Size (in)', 'Arm Length (in)', 'Wonderlic', '40 Yard', 'Bench Press', 'Vert Leap (in)', 'Broad Jump (in)', 'Shuttle', '3Cone', '60Yd Shuttle']
['2015', 'Ameer Abdullah', 'Nebraska', 'RB', '69', '205', '8.63', '30.00', '', '4.60', '24', '42.5', '130', '3.95', '6.79', '11.18']
['2015', 'Nelson Agholor', 'Southern California', 'WR', '73', '198', '9.25', '32.25', '', '4.42', '12', '', '', '', '', '']
['2015', 'Malcolm Agnew', 'Southern Illinois', 'RB', '70', '202', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Jay Ajayi', 'Boise State', 'RB', '73', '221', '10.00', '32.00', '24', '4.57', '19', '39.0', '121', '4.10', '7.10', '11.10']
['2015', 'Brandon Alexander', 'Central Florida', 'FS', '74', '195', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Kwon Alexander', 'Louisiana State', 'OLB', '73', '227', '9.25', '30.25', '', '4.55', '24', '36.0', '121', '4.20', '7.14', '']
['2015', 'Mario Alford', 'West Virginia', 'WR', '68', '180', '9.38', '31.25', '', '4.43', '13', '34.0', '121', '4.07', '6.64', '11.22']
['2015', 'Detric Allen', 'East Carolina', 'CB', '73', '200', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Javorius Allen', 'Southern California', 'RB', '73', '221', '9.38', '31.75', '12', '4.53', '11', '35.5', '121', '4.28', '6.96', '']

正如您所看到的,表中有很多空字段,更好的方法是将所有字段放在列表中,然后解压缩它们或使用namedtuple。 这将提高您的代码稳定性。

答案 1 :(得分:1)

  

AttributeError:&#39; NoneType&#39;对象没有属性&#39; strip&#39;

实际错误发生在具有单个单元格的表格的最后一行,这是它的HTML:

<tr style="background-color:#333333;"><td colspan="15">&nbsp;</td></tr>

只需切片

for row in table_rows[1:-1]:

就提高代码的整体质量而言,您可以/应该关注@宏杰李的答案。

答案 2 :(得分:1)

或者,如果您想尝试使用pandas,您可以这样做:

import pandas as pd

df = pd.read_html("http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=")[0]
df.head()

输出:

enter image description here

相关问题