我正试图从Wikipedia获得西甲联赛表,但我甚至无法使用find_all
甚至到达我试图抓住的桌子。而且,我写的完全相同的代码从维基百科中完全删除了EPL数据......
完整的HTML就在这里:view-source:https://en.wikipedia.org/wiki/2015%E2%80%9316_La_Liga
有问题的部分在这里:
<h2><span class="mw-headline" id="League_table">League table</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%E2%80%9316_La_Liga&action=edit&section=6" title="Edit section: League table">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<h3><span class="mw-headline" id="Standings">Standings</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%E2%80%9316_La_Liga&action=edit&section=7" title="Edit section: Standings">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
<table class="wikitable" style="text-align:center;">
<tr>
<th scope="col" width="28"><abbr title="Position">Pos</abbr>
</th>
<th scope="col" width="190">Team
<div class="plainlinks hlist navbar mini" style="float:right">
<ul>
<li class="nv-view"><a href="/wiki/Template:2015%E2%80%9316_La_Liga_table" title="Template:2015–16 La Liga table"><span title="View this template">v</span></a>
</li>
<li class="nv-talk"><a href="/wiki/Template_talk:2015%E2%80%9316_La_Liga_table" title="Template talk:2015–16 La Liga table"><span title="Discuss this template">t</span></a>
</li>
<li class="nv-edit"><a class="external text" href="//en.wikipedia.org/w/index.php?title=Template:2015%E2%80%9316_La_Liga_table&action=edit"><span title="Edit this template">e</span></a>
</li>
</ul>
</div>
</th>
这是我在尝试查找所有表之前请求页面和我唯一清理代码的方法:
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2015-16_La_Liga").text, "html.parser")
for superscript in soup.find_all("sup"):
superscript.decompose()
print len(soup.find_all("table", attrs={"class": "wikitable"}))
然而,当我查看页面html时,我得到的长度为2,我应该至少获得14个具有这些属性的表...
我不知道从哪里开始,任何帮助都将不胜感激
- 编辑 -
答案 0 :(得分:1)
一切正常。
PyQuery版
from pyquery import PyQuery
pq = PyQuery(url="https://en.wikipedia.org/wiki/2015-16_La_Liga")
all_tables = pq(".wikitable")
print len(all_tables)
BeautifulSoup版
__author__ = "Leonard Richardson (leonardr@segfault.org)"
__version__ = "4.3.2"
__copyright__ = "Copyright (c) 2004-2013 Leonard Richardson"
__license__ = "MIT"
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2015-16_La_Liga").text, "html.parser")
for superscript in soup.find_all("sup"):
superscript.decompose()
print len(soup.find_all("table", attrs={"class": "wikitable"}))
返回13两个版本
也许您应该安装4.3.2版本的bs或使用PyQuery?
答案 1 :(得分:0)
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://en.wikipedia.org/wiki/2015-16_La_Liga')
bsObj = BeautifulSoup(html.read(), 'lxml')
result = bsObj.find_all("table", class_="wikitable")
print (result)
但这只有13个表,我实际上在网址上看到了
python==3.4.3
beautifulsoup4==4.4.1
您也可以使用pip install requests