我已经完成了类似问题的大多数解决方案,但是没有找到一个有效的解决方案,更重要的是,我们没有找到解释为什么在Javascript或其他东西被调用之外发生这种情况的原因该网站被刮掉了。
我正在努力争取桌面游戏"官员"来自网站: http://www.pro-football-reference.com/boxscores/201309050den.htm
我的代码是:
url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = urlopen(url)
bsObj = BeautifulSoup(html, "lxml")
officials = bsObj.findAll("table",{"id":"officials"})
for entry in officials:
print(str(entry))
我现在只是打印到控制台,但是我找到一个带有findAll或None的空列表。 我也用基本的html.parser尝试了这个,没有运气。
对html有更好理解的人能否就这个网页的具体内容向我发表教育?提前谢谢!
答案 0 :(得分:1)
试试这段代码:
from selenium import webdriver
import time
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url= "http://www.pro-football-reference.com/boxscores/201309050den.htm"
driver.maximize_window()
driver.get(url)
time.sleep(5)
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content,"html.parser")
officials = soup.findAll("table",{"id":"officials"})
for entry in officials:
print(str(entry))
driver.quit()
它将打印:
<table class="suppress_all sortable stats_table now_sortable" data-cols-to-freeze="0" id="officials"><thead><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr></thead><caption>Officials Table</caption><tbody>
<tr data-row="0"><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr data-row="1"><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr data-row="2"><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr data-row="3"><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr data-row="4"><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr data-row="5"><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr data-row="6"><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</tbody></table>
答案 1 :(得分:1)
它在源代码中,只是注释掉了,使用正则表达式删除注释是微不足道的:
from bs4 import BeautifulSoup
import requests
import re
url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
html = requests.get(url).content
bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
officials = bsObj.find_all("table",{"id":"officials"})
for entry in officials:
print(entry)
只有一个表,所以你不需要find_all,你的循环有点没用,只需使用 find :
In [1]: from bs4 import BeautifulSoup
...: import requests
...: import re
...: url = "http://www.pro-football-reference.com/boxscores/201309050den.htm"
...:
...: html = requests.get(url).content
...: bsObj = BeautifulSoup(re.sub("<!--|-->","", html), "lxml")
...: officials = bsObj.find(id="officials")
...: print(officials)
...:
<table class="suppress_all sortable stats_table" data-cols-to-freeze="0" id="officials"><caption>Officials Table</caption><tr class="thead onecell"><td class=" center" colspan="2" data-stat="onecell">Officials</td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Referee</th><td class=" " data-stat="name"><a href="/officials/ColeWa0r.htm">Walt Coleman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Umpire</th><td class=" " data-stat="name"><a href="/officials/ElliRo0r.htm">Roy Ellison</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Head Linesman</th><td class=" " data-stat="name"><a href="/officials/BergJe1r.htm">Jerry Bergman</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Field Judge</th><td class=" " data-stat="name"><a href="/officials/GautGr0r.htm">Greg Gautreaux</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Back Judge</th><td class=" " data-stat="name"><a href="/officials/YettGr0r.htm">Greg Yette</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Side Judge</th><td class=" " data-stat="name"><a href="/officials/PattRi0r.htm">Rick Patterson</a></td></tr>
<tr><th class=" " data-stat="ref_pos" scope="row">Line Judge</th><td class=" " data-stat="name"><a href="/officials/BaynRu0r.htm">Rusty Baynes</a></td></tr>
</table>
In [2]:
答案 2 :(得分:0)