如何只解析特定的标记值

时间:2016-07-18 20:20:53

标签: python html-parsing bs4

我是Python的新手,我已经编写了一些代码来解析来自ESPN site的数据!

#importing packages/modules
from bs4 import BeautifulSoup
import pandas as pd
import urllib

#additional data for scraping
url = 'http://espn.go.com/nhl/statistics/player/_/stat/points/sort/points/year/2015/seasontype/'

#scraping the site
page = urllib.request.urlopen(url + str(2)).read()
soup = BeautifulSoup(page)

#the header of the table
table_header = [ td.get_text() for td in soup.find_all('td')[:20] ]
table_header[-1] = 'SH A'
table_header[-2] = 'SH G'
table_header[-3] = 'PP A'
table_header[-4] = 'PP G'
table_header.remove('')
table_header.remove('PP')
table_header.remove('SH')

#the data for table


#print(player_names = [ a.get_text() for a in soup.find_all('tr') ])
player_name = [ a.get_text() for a in soup.find_all('tr')[2:12] ]

问题是 - 如何仅获取位于<a>标记之间的数据,因为如果我print()列出了 player_name

['1Jamie Benn, LWDAL823552871641.0625313.86101323', '2John Tavares, C NYI823848865461.0527813.78131801', '3Sidney Crosby, C PIT772856845471.0923711.83102100', '4Alex Ovechkin, LWWSH8153288110581.0039513.41125900', '\xa0Jakub Voracek, RWPHI822259811780.9922110.03112200', '6Nicklas Backstrom, C WSH821860785400.9515311.8333000', '7Tyler Seguin, C DAL71374077-1201.0828013.25131600', '8Jiri Hudler, LWCGY7831457617140.9715819.6561000', '\xa0Daniel Sedin, LWVAN822056765180.932268.9542100', '10Vladimir Tarasenko, RWSTL7737367327310.9526414.0681000']

非常感谢你的帮助!

1 个答案:

答案 0 :(得分:0)

我会使用CSS selector来仅查找数据行(包含玩家数据):

for tr in soup.select("#my-players-table tr[class*=player]"):
    player_name = tr('td')[1].get_text(strip=True)
    print(player_name)

class*=player表示“类属性包含播放器”。

打印:

Jamie Benn, LW
John Tavares, C
Sidney Crosby, C
Alex Ovechkin, LW
...
Jordan Eberle, RW
Ondrej Palat, LW
Zach Parise, LW