Question

背景：我正试图从这个pro-football-reference page.中抓取一些表格我是Python的一个完全新手，所以很多技术术语最终都丢失了，但是在试图理解如何解决问题，我无法弄清楚。

具体问题：因为页面上有多个表，我无法弄清楚如何让python以我想要的目标为目标。我正试图获得国防和摸索表。下面的代码是我到目前为止所使用的，from this tutorial使用来自同一网站的页面 - 但只有一个表格。

示例代码：

#url we are scraping
url = "https://www.pro-football-reference.com/teams/nwe/2017.htm"

#html from the given url
html=urlopen(url)

# make soup object of html
soup = BeautifulSoup(html)

# we see that soup is a beautifulsoup object
type(soup) 

#
column_headers = [th.getText() for th in 
                  soup.findAll('table', {"id": "defense").findAll('th')]

column_headers #our column headers

尝试：我意识到教程的方法对我不起作用，所以我试图改变soup.findAll部分以定位特定的表。但我反复得到一个错误说：

AttributeError：ResultSet对象没有属性'findAll'。您可能正在处理像单个项目的项目列表。当你打算调用find（）时，你调用了find_all（）吗？

当更改它以查找时，错误变为：

AttributeError：'NoneType'对象没有属性'find'

我绝对诚实，我不知道我在做什么或者这些意味着什么。我很感激任何帮助，以确定如何定位数据，然后刮掉它。

谢谢，

Answer 1

在“防御”一词之后你在dict中错过了一个“}”。请尝试以下操作，看看它是否有效。

column_headers = [th.getText（）for the in soup.findAll（'table'，{“id”：“defense”}）。findAll（'th'）]

Answer 2

首先，您要使用soup.find('table', {"id": "defense"}).findAll('th') - 找到一个表，然后找到所有的“th”标签。

另一个问题是id为“defense”的表在该页面的html中被注释掉了：

<div class="placeholder"></div>
<!--
   <div class="table_outer_container">
      <div class="overthrow table_container" id="div_defense">
  <table class="sortable stats_table" id="defense" data-cols-to-freeze=2><caption>Defense &amp; Fumbles Table</caption>
   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>

等。我假设javascript没有隐藏它。 BeautifulSoup不会解析评论文本，因此您需要在this answer中找到页面上所有评论的文字，在其中查找包含id="defense"的评论，然后输入对BeautifulSoup的评论文本。

像这样：

from bs4 import Comment
comments = comments = soup.findAll(text=lambda text:isinstance(text, Comment))
defenseComment = next(c for c in comments if 'id="defense"' in c)
defenseSoup = BeautifulSoup(str(defenseComment))

用Python为初学者刮PFR足球数据

2 个答案: