1988年以来,我一直在尝试使用python的漂亮汤4淘汰Wikipedia的nhl季后赛支架。格式不一致(有时会连续有一个以上的团队,请参见:(https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs)使这一工作变得困难。我想确定当年每个系列的团队,回合和获胜次数。< / p>
最初,我将表格转换为文本,并使用正则表达式来标识团队和信息,但是顺序的变化取决于方括号是否允许每行多于一个团队。
现在,我正在尝试沿行向下移动并计算像单元格/列跨度之类的内容,但结果不一致。我不知道如何确定第四轮球队。
到目前为止,我试图在到达与一个小组的一个单元之前对单元数进行计数...
from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
'Winnipeg','NY Rangers','NY Islanders']
#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")
tables = page_content.find_all('table')
cnt = 0
#identify the appropriate table
for table in tables:
if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
bracket = table
break
row_num = 0
for row in bracket.find_all('tr'):
row_num += 1
print(row_num,'#')
colcnt = 0
for col in row.find_all('td'):
if "colspan" in col.attrs:
colcnt += int(col.attrs['colspan'])
else:
colcnt += 1
if (col.text.strip(' \n') in str(hockeyteams)):
print(colcnt,col.text)
print('col width:',colcnt)
最终,我想要一个像这样的数据框,它具有:
A队获胜,B队获胜B
1,坦帕湾,4,纽约岛民,1
2,坦帕湾,4,蒙特利尔,0
等
答案 0 :(得分:0)
那张桌子可以被大熊猫刮掉
import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')
bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)
输出充满了NaN
,但它具有我认为您正在寻找的内容,您可以使用标准的pandas方法对其进行修改。