美丽的汤缺少来自维基百科的表

时间:2015-09-30 18:26:30

标签: python beautifulsoup html-table

我正试图从Wikipedia获得西甲联赛表,但我甚至无法使用find_all甚至到达我试图抓住的桌子。而且,我写的完全相同的代码从维基百科中完全删除了EPL数据......

完整的HTML就在这里:view-source:https://en.wikipedia.org/wiki/2015%E2%80%9316_La_Liga

有问题的部分在这里:

<h2><span class="mw-headline" id="League_table">League table</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%E2%80%9316_La_Liga&amp;action=edit&amp;section=6" title="Edit section: League table">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<h3><span class="mw-headline" id="Standings">Standings</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=2015%E2%80%9316_La_Liga&amp;action=edit&amp;section=7" title="Edit section: Standings">edit</a><span class="mw-editsection-bracket">]</span></span></h3>
<table class="wikitable" style="text-align:center;">
  <tr>
    <th scope="col" width="28"><abbr title="Position">Pos</abbr>
    </th>
    <th scope="col" width="190">Team
      <div class="plainlinks hlist navbar mini" style="float:right">
        <ul>
          <li class="nv-view"><a href="/wiki/Template:2015%E2%80%9316_La_Liga_table" title="Template:2015–16 La Liga table"><span title="View this template">v</span></a>
          </li>
          <li class="nv-talk"><a href="/wiki/Template_talk:2015%E2%80%9316_La_Liga_table" title="Template talk:2015–16 La Liga table"><span title="Discuss this template">t</span></a>
          </li>
          <li class="nv-edit"><a class="external text" href="//en.wikipedia.org/w/index.php?title=Template:2015%E2%80%9316_La_Liga_table&amp;action=edit"><span title="Edit this template">e</span></a>
          </li>
        </ul>
      </div>
    </th>

这是我在尝试查找所有表之前请求页面和我唯一清理代码的方法:

soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2015-16_La_Liga").text, "html.parser")
for superscript in soup.find_all("sup"):
    superscript.decompose()
print len(soup.find_all("table", attrs={"class": "wikitable"}))

然而,当我查看页面html时,我得到的长度为2,我应该至少获得14个具有这些属性的表...

我不知道从哪里开始,任何帮助都将不胜感激

- 编辑 -

Input/Output Output of soup shows that the wikitable is still there...

2 个答案:

答案 0 :(得分:1)

一切正常。

PyQuery版

from pyquery import PyQuery

pq = PyQuery(url="https://en.wikipedia.org/wiki/2015-16_La_Liga")
all_tables = pq(".wikitable")
print len(all_tables)

BeautifulSoup版

__author__ = "Leonard Richardson (leonardr@segfault.org)"
__version__ = "4.3.2"
__copyright__ = "Copyright (c) 2004-2013 Leonard Richardson"
__license__ = "MIT"

from bs4 import BeautifulSoup
import requests



soup = BeautifulSoup(requests.get("https://en.wikipedia.org/wiki/2015-16_La_Liga").text, "html.parser")
for superscript in soup.find_all("sup"):
    superscript.decompose()
print len(soup.find_all("table", attrs={"class": "wikitable"}))

返回13两个版本

也许您应该安装4.3.2版本的bs或使用PyQuery?

答案 1 :(得分:0)

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://en.wikipedia.org/wiki/2015-16_La_Liga')
bsObj = BeautifulSoup(html.read(), 'lxml')
result = bsObj.find_all("table", class_="wikitable")
print (result)

但这只有13个表,我实际上在网址上看到了

python==3.4.3
beautifulsoup4==4.4.1

您也可以使用pip install requests

安装请求