Python - 使用BeautifulSoup刮擦不显示所有行

时间:2016-12-17 19:07:40

标签: python web-scraping beautifulsoup

我是BeautifulSoup的新手。我正试图从ESPN Fantasy Basketball Standings中删除“季节统计”表,但不会返回所有行。经过一些研究,我认为它可能是html.parser的一个问题,所以我使用了lxml。我得到了相同的结果。如果有人能告诉我如何获得所有球队的名字,我将不胜感激。

我的代码:

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(urlopen("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017"),'html.parser')
tableStats = soup.find("table", {"class" : "tableBody"})
for row in tableStats.findAll('tr')[2:]:
    col = row.findAll('td')

    try:
        name = col[0].a.string.strip()
        print(name)
    except Exception as e:
        print(str(e))

输出(如您所见,只显示几个团队名称):

Le Tuc Grizzlies Peyton Ravens Heaven Vultures Versailles Golden Bears Baltimore Corto's La Murette Scavengers XO Gayfishes

2 个答案:

答案 0 :(得分:1)

你好像完全错了table。您可以使用find()来代替为<table>标记运行findAll(),而是查找具有整个排名的正确表格。另外我注意到stats表有一个名为id的特殊表statsTable。最好是查找此id而不是class,因为HTML文件是唯一的。

请查看以下代码中的注释以获取更多指南,

from bs4 import BeautifulSoup
import requests
# Note, I'm using requests here as it's a superior library
text = requests.get("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017").text
soup = BeautifulSoup(text,'html.parser')
# searching by id, always a better option when available
tableStats = soup.find("table", {"id" : "statsTable"})
for row in tableStats.findAll('tr')[3:]:
    col = row.findAll('td')
    try:
        # This fetches all the text in the tag stripped off all the HTML
        name = col[1].get_text()
        print(name)
    except Exception as e:
        print(str(e))

答案 1 :(得分:0)

解析包含所有团队的id="statsTable"可能更容易,例如:

from bs4 import BeautifulSoup
from urllib2 import urlopen
soup = BeautifulSoup(urlopen("http://games.espn.com/fba/standings?leagueId=20960&seasonId=2017"),'html.parser')
tableStats = soup.find('table', id="statsTable")
for row in tableStats.findAll('a', href=True):
    print row.text