这是我的代码:
source = urllib.request.urlopen('http://nflcombineresults.com/nflcombinedata_expanded.php ?year=2015&pos=&college=').read()
soup = bs.BeautifulSoup(source, 'lxml')
table = soup.table
table = soup.find(id='datatable')
table_rows = table.find_all('tr')
#print(table_rows)
year = []
name = []
college = []
pos = []
height = []
weight = []
hand_size = []
arm_length = []
wonderlic = []
fortyyrd = []
for row in table_rows[1:]:
col = row.find_all('td')
#row = [i.text for i in td]
#print(col[4])
# Create a variable of the string inside each <td> tag pair,
column_1 = col[0].string.strip()
# and append it to each variable
year.append(column_1)
column_2 = col[1].string.strip()
name.append(column_2)
column_3 = col[2].string.strip()
college.append(column_3)
column_4 = col[3].string.strip()
pos.append(column_4)
#print(col[4])
column_5 = col[4].string.strip()
height.append(column_5)
我想要添加的表中还有几列,但每当我尝试运行最后两行时,我都会收到错误消息:
“AttributeError:'NoneType'对象没有属性'strip'”
当我在这行上方打印col [4]时,我得到:
<td><div align="center">69</div></td>
我原本以为这是由于缺少数据,但网站原始表中第一个缺失数据的实例位于第一行的第9列(Wonderlic),而不是第4列。
我想要添加到我的数据帧中的代码片段中还没有包含其他几个列,尽管该单元格中有条目,但我也会收到它们的NoneType错误。
我很擅长使用BeautifulSoup从网站解析表格,所以这可能是一个愚蠢的问题,但是为什么这个对象NoneType如何解决这个问题所以我可以把这个表放到pandas数据帧中?
答案 0 :(得分:1)
import requests
from bs4 import BeautifulSoup
r = requests.get('http://nflcombineresults.com/nflcombinedata_expanded.php?year=2015&pos=&college=')
soup = BeautifulSoup(r.text, 'lxml')
for tr in soup.table.find_all('tr'):
row = [td.text for td in tr.find_all('td')]
print (row)
出:
['Year', 'Name', 'College', 'POS', 'Height (in)', 'Weight (lbs)', 'Hand Size (in)', 'Arm Length (in)', 'Wonderlic', '40 Yard', 'Bench Press', 'Vert Leap (in)', 'Broad Jump (in)', 'Shuttle', '3Cone', '60Yd Shuttle']
['2015', 'Ameer Abdullah', 'Nebraska', 'RB', '69', '205', '8.63', '30.00', '', '4.60', '24', '42.5', '130', '3.95', '6.79', '11.18']
['2015', 'Nelson Agholor', 'Southern California', 'WR', '73', '198', '9.25', '32.25', '', '4.42', '12', '', '', '', '', '']
['2015', 'Malcolm Agnew', 'Southern Illinois', 'RB', '70', '202', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Jay Ajayi', 'Boise State', 'RB', '73', '221', '10.00', '32.00', '24', '4.57', '19', '39.0', '121', '4.10', '7.10', '11.10']
['2015', 'Brandon Alexander', 'Central Florida', 'FS', '74', '195', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Kwon Alexander', 'Louisiana State', 'OLB', '73', '227', '9.25', '30.25', '', '4.55', '24', '36.0', '121', '4.20', '7.14', '']
['2015', 'Mario Alford', 'West Virginia', 'WR', '68', '180', '9.38', '31.25', '', '4.43', '13', '34.0', '121', '4.07', '6.64', '11.22']
['2015', 'Detric Allen', 'East Carolina', 'CB', '73', '200', '', '', '', '*4.59', '', '', '', '', '', '']
['2015', 'Javorius Allen', 'Southern California', 'RB', '73', '221', '9.38', '31.75', '12', '4.53', '11', '35.5', '121', '4.28', '6.96', '']
正如您所看到的,表中有很多空字段,更好的方法是将所有字段放在列表中,然后解压缩它们或使用namedtuple。 这将提高您的代码稳定性。
答案 1 :(得分:1)
AttributeError:&#39; NoneType&#39;对象没有属性&#39; strip&#39;
实际错误发生在具有单个单元格的表格的最后一行,这是它的HTML:
<tr style="background-color:#333333;"><td colspan="15"> </td></tr>
只需切片:
for row in table_rows[1:-1]:
就提高代码的整体质量而言,您可以/应该关注@宏杰李的答案。
答案 2 :(得分:1)