使用美丽的汤提取Web数据(Python 2.7)

时间:2017-11-20 19:19:05

标签: python python-2.7 web-scraping beautifulsoup

在下面的代码示例中,我试图按预期刮取返回值的5个元素中的3个。 2(goal_scored和assists)不返回任何值。我已经验证数据确实存在于网页上,并且我使用的是正确的属性,但不确定为什么结果不会返回。我有什么明显的东西可以忽略吗?

import sys
from bs4 import BeautifulSoup as bs
import urllib2 
import datetime as dt
import time
import pandas as pd

proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support)

player_name=[]
club =[]
position = []
goals_scored = []
assists = []

for p in range(25):
player_url = 'http://www.mlssoccer.com/stats/season?page={p}&franchise=select&year=2017&season_type=REG&group=goals'.format( 
        p=p) 
page = opener.open(player_url).read() 
player_soup = bs(page,"lxml") 
print >>sys.stderr, '[{time}] Running page {n}...'.format( 
        time=dt.datetime.now(), n=p) 
length = len(player_soup.find('tbody').findAll('tr'))

for row in range(0, length):
    try:
        name = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Player'})[row].find('a').contents[0]
        player_name.append(name)
        team = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Club'})[row].contents[0]
        club.append(team)
        pos = player_soup.find('tbody').findAll('td', attrs={'data-title': 'POS'})[row].contents[0]
        position.append(pos)
        goals = player_soup.find('tbody').findAll('td', attrs={'data-title': 'G' ,'class': 'responsive'})[row].contents[0]
        goals_scored.apppend(goals)
        a = player_soup.find('tbody').findAll('td', attrs={'data-title': 'A'})[row].contents[0]
        assists.append(a)
    except:
        pass    

player_data = {'player_name':player_name,
'club':club,
'position' : position,
'goals_scored' : goals_scored,
'assists' : assists,
 } 

df = pd.DataFrame.from_dict(player_data,orient='index')

df

我唯一可以弄清楚的是,HTML中没有返回数据的变量存在细微差别。我需要在我的代码中包含class = responsive吗?如果是这样,那可能是什么样的例子呢?

定位HTML:F

目标HTML:11

赞赏任何见解

1 个答案:

答案 0 :(得分:0)

您可以尝试这样来获取所需的数据。我只解析了你需要的部分。您可以为数据帧执行其余操作。仅供参考,有两种类型的类附加到不同的td标签。 oddeven。不要忘记考虑这一点。

from bs4 import BeautifulSoup
import requests

page_url = "https://www.mlssoccer.com/stats/season?page={0}&franchise=select&year=2017&season_type=REG&group=goals"
for url in [page_url.format(p) for p in range(5)]:
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    table = soup.select("table")[0]
    for items in table.select(".odd,.even"):
        player = items.select("td[data-title='Player']")[0].text 
        club = items.select("td[data-title='Club']")[0].text
        position = items.select("td[data-title='POS']")[0].text
        goals = items.select("td[data-title='G']")[0].text
        assist = items.select("td[data-title='A']")[0].text
        print(player,club,position,goals,assist)

部分结果如下:

Nemanja Nikolic CHI F 24 4
Diego Valeri POR M 21 11
Ola Kamara CLB F 18 3

由于我已经在我的脚本中包含了这两个类,因此您将从该站点获取所有数据。