使用python进行网络抓取。无法访问td元素

时间:2018-09-12 00:18:15

标签: python html web-scraping

我正在尝试从以下地址抓取网页:https://www.pro-football-reference.com/boxscores/

这是美式足球比赛得分的页面。我想知道每场比赛的日期,赢家和输家。我可以很方便地获取日期,但是无法弄清楚如何为获胜者和失败者隔离并获取球队名称。 到目前为止我所拥有的...

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup


#assigning url
my_url = 'https://www.pro-football-reference.com/boxscores/'

# opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

# html parsing
page_soup = soup(page_html,"html.parser")

games = page_soup.findAll("div",{"class":"game_summary expanded nohover"})


for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    winner_block = game.findAll("tr",{"class":"winner"})
    #here I need a line that returns the game winner, e.g. "Philadelphia Eagles"
    loser = game.findAll("tr",{"class":"loser"})

这是相关的html ...

<div class="game_summary expanded nohover">
<table class="teams">
    <tbody>
        <tr class="date">
            <td colspan="3">Sep 6, 2018</td>
        </tr>
        <tr class="loser">
            <td><a href="/teams/atl/2018.htm">Atlanta Falcons</a></td>
            <td class="right">12</td>
            <td class="right gamelink">
                <a href="/boxscores/201809060phi.htm">Final</a>
            </td>
        </tr>
        <tr class="winner">
            <td><a href="/teams/phi/2018.htm">Philadelphia Eagles</a></td>
            <td class="right">18</td>
            <td class="right">
            </td>
        </tr>
    </tbody>
</table>
<table class="stats">
    <tbody>
        <tr>
            <td><strong>PassYds</strong></td>
            <td><a href="/players/R/RyanMa00.htm" title="Matt Ryan">Ryan</a>-ATL</td>
            <td class="right">251</td>
        </tr>
        <tr>
            <td><strong>RushYds</strong></td>
            <td><a href="/players/A/AjayJa00.htm" title="Jay Ajayi">Ajayi</a>-PHI</td>
            <td class="right">62</td>
        </tr>
        <tr>
            <td><strong>RecYds</strong></td>
            <td><a href="/players/J/JoneJu02.htm" title="Julio Jones">Jones</a>-ATL</td>
            <td class="right">169</td>
        </tr>
    </tbody>
</table>

我收到一条错误消息,说ResultSet对象没有属性“ td”。任何帮助将不胜感激

4 个答案:

答案 0 :(得分:1)

请谨慎对待平局游戏,我认为这是导致您犯错的原因,因为在这种情况下没有赢家,因此您不会在赢家类别中找到任何一行。下面的代码输出日期和获胜者。

for game in games:
    date_block = game.find('tr',{'class':'date'})
    date_val = date_block.text
    winner_block = game.find('tr',{'class':'winner'})
    if winner_block:
        winner = winner_block.find('a').text
        print(date_val)
        print(winner)
    loser = game.findAll('tr',{'class':'loser'})

输出:

Sep 6, 2018
Philadelphia Eagles
Sep 9, 2018
New England Patriots
Sep 9, 2018
Tampa Bay Buccaneers
Sep 9, 2018
Minnesota Vikings
Sep 9, 2018
Miami Dolphins
Sep 9, 2018
Cincinnati Bengals
Sep 9, 2018
Baltimore Ravens
Sep 9, 2018
Jacksonville Jaguars
Sep 9, 2018
Kansas City Chiefs
Sep 9, 2018
Denver Broncos
Sep 9, 2018
Washington Redskins
Sep 9, 2018
Carolina Panthers
Sep 9, 2018
Green Bay Packers
Sep 10, 2018
New York Jets
Sep 10, 2018
Los Angeles Rams

答案 1 :(得分:0)

您的代码看起来非常正确。

html = ''' ... '''
soup = bs4.BeautifulSoup(html, 'lxml')  # or 'html.parser' either way
print([elem.text for elem in soup.find_all('tr', {'class': 'loser'})])
['\nAtlanta Falcons\n12\n\nFinal\n\n']`

到底是什么问题?

答案 2 :(得分:0)

您可以从"game_summaries" div锚定搜索:

import requests, json
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.pro-football-reference.com/boxscores/').text, 'html.parser')
def get_data(_soup_obj, _headers):
  _d = [(lambda x:[c.text for c in x.find_all('td')] if x is not None else [])(_soup_obj.find(a, {'class':b})) for a, b in _headers]
  if all(_d):
    [date], [t1, val, _], [t2, val2, _] = _d
    return {'date':date, 'winner':{'team':t1, 'score':int(val)}, 'loser':{'team':t2, 'score':int(val2)}}
  return {}

headers = [['tr', 'date'], ['tr', 'winner'], ['tr', 'loser']]
games = [get_data(i, headers) for i in d.find('div', {'class':'game_summaries'}).find_all('div', {'class':'game_summary'})]
print(json.dumps(games, indent=4))

输出:

[
  {
    "date": "Sep 6, 2018",
    "winner": {
        "team": "Philadelphia Eagles",
        "score": 18
    },
    "loser": {
        "team": "Atlanta Falcons",
        "score": 12
    }
 },
  {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "New England Patriots",
        "score": 27
    },
    "loser": {
        "team": "Houston Texans",
        "score": 20
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Tampa Bay Buccaneers",
        "score": 48
    },
    "loser": {
        "team": "New Orleans Saints",
        "score": 40
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Minnesota Vikings",
        "score": 24
    },
    "loser": {
        "team": "San Francisco 49ers",
        "score": 16
    }
 },
 {
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Miami Dolphins",
        "score": 27
    },
    "loser": {
        "team": "Tennessee Titans",
        "score": 20
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Cincinnati Bengals",
        "score": 34
    },
    "loser": {
        "team": "Indianapolis Colts",
        "score": 23
    }
},
{},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Baltimore Ravens",
        "score": 47
    },
    "loser": {
        "team": "Buffalo Bills",
        "score": 3
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Jacksonville Jaguars",
        "score": 20
    },
    "loser": {
        "team": "New York Giants",
        "score": 15
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Kansas City Chiefs",
        "score": 38
    },
    "loser": {
        "team": "Los Angeles Chargers",
        "score": 28
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Denver Broncos",
        "score": 27
    },
    "loser": {
        "team": "Seattle Seahawks",
        "score": 24
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Washington Redskins",
        "score": 24
    },
    "loser": {
        "team": "Arizona Cardinals",
        "score": 6
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Carolina Panthers",
        "score": 16
    },
    "loser": {
        "team": "Dallas Cowboys",
        "score": 8
    }
},
{
    "date": "Sep 9, 2018",
    "winner": {
        "team": "Green Bay Packers",
        "score": 24
    },
    "loser": {
        "team": "Chicago Bears",
        "score": 23
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "New York Jets",
        "score": 48
    },
    "loser": {
        "team": "Detroit Lions",
        "score": 17
    }
},
{
    "date": "Sep 10, 2018",
    "winner": {
        "team": "Los Angeles Rams",
        "score": 33
    },
    "loser": {
        "team": "Oakland Raiders",
        "score": 13
     }
  }
]

答案 3 :(得分:0)

您可能会遇到本周打成平手的问题。匹兹堡/克里夫兰比赛中没有赢家TD。运行此命令将输出所有游戏,包括平局游戏:

for game in games:
    date_block = game.findAll("tr",{"class":"date"})
    date_val = date_block[0].text
    print "Game Date: %s" % (date_val)
    #Test if a winner is defined
    if game.find("tr",{"class":"winner"}) is not None:        


        winner_block = game.findAll("tr",{"class":"winner"})
        #Get the winner from the first TD and print text only
        winner = winner_block[0].findAll("td")
        print "Winner: %s" % (winner[0].get_text())

        loser_block = game.findAll("tr",{"class":"loser"})
        #Get the loser from the first TD and print text only
        loser = loser_block[0].findAll("td")
        print "Loser: %s" % (loser[0].get_text())
    else:
        #If no winner is listed, it must be a tie. Get both teams and print them.
        print "Its a tie!"
        draw_block  = game.findAll("tr",{"class":"draw"})
        for team in draw_block:
            print "Draw : %s"   % (team.findAll("td")[0].get_text())