在下面的代码示例中,我试图按预期刮取返回值的5个元素中的3个。 2(goal_scored和assists)不返回任何值。我已经验证数据确实存在于网页上,并且我使用的是正确的属性,但不确定为什么结果不会返回。我有什么明显的东西可以忽略吗?
import sys
from bs4 import BeautifulSoup as bs
import urllib2
import datetime as dt
import time
import pandas as pd
proxy_support = urllib2.ProxyHandler({})
opener = urllib2.build_opener(proxy_support)
player_name=[]
club =[]
position = []
goals_scored = []
assists = []
for p in range(25):
player_url = 'http://www.mlssoccer.com/stats/season?page={p}&franchise=select&year=2017&season_type=REG&group=goals'.format(
p=p)
page = opener.open(player_url).read()
player_soup = bs(page,"lxml")
print >>sys.stderr, '[{time}] Running page {n}...'.format(
time=dt.datetime.now(), n=p)
length = len(player_soup.find('tbody').findAll('tr'))
for row in range(0, length):
try:
name = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Player'})[row].find('a').contents[0]
player_name.append(name)
team = player_soup.find('tbody').findAll('td', attrs={'data-title': 'Club'})[row].contents[0]
club.append(team)
pos = player_soup.find('tbody').findAll('td', attrs={'data-title': 'POS'})[row].contents[0]
position.append(pos)
goals = player_soup.find('tbody').findAll('td', attrs={'data-title': 'G' ,'class': 'responsive'})[row].contents[0]
goals_scored.apppend(goals)
a = player_soup.find('tbody').findAll('td', attrs={'data-title': 'A'})[row].contents[0]
assists.append(a)
except:
pass
player_data = {'player_name':player_name,
'club':club,
'position' : position,
'goals_scored' : goals_scored,
'assists' : assists,
}
df = pd.DataFrame.from_dict(player_data,orient='index')
df
我唯一可以弄清楚的是,HTML中没有返回数据的变量存在细微差别。我需要在我的代码中包含class = responsive吗?如果是这样,那可能是什么样的例子呢?
定位HTML:F
目标HTML:11
赞赏任何见解
答案 0 :(得分:0)
您可以尝试这样来获取所需的数据。我只解析了你需要的部分。您可以为数据帧执行其余操作。仅供参考,有两种类型的类附加到不同的td标签。 odd
和even
。不要忘记考虑这一点。
from bs4 import BeautifulSoup
import requests
page_url = "https://www.mlssoccer.com/stats/season?page={0}&franchise=select&year=2017&season_type=REG&group=goals"
for url in [page_url.format(p) for p in range(5)]:
soup = BeautifulSoup(requests.get(url).text, "lxml")
table = soup.select("table")[0]
for items in table.select(".odd,.even"):
player = items.select("td[data-title='Player']")[0].text
club = items.select("td[data-title='Club']")[0].text
position = items.select("td[data-title='POS']")[0].text
goals = items.select("td[data-title='G']")[0].text
assist = items.select("td[data-title='A']")[0].text
print(player,club,position,goals,assist)
部分结果如下:
Nemanja Nikolic CHI F 24 4
Diego Valeri POR M 21 11
Ola Kamara CLB F 18 3
由于我已经在我的脚本中包含了这两个类,因此您将从该站点获取所有数据。