试图用漂亮的汤料从维基百科上淘汰季后赛。如何识别正确的列?

时间:2019-05-05 23:51:18

标签: web-scraping html-table beautifulsoup

1988年以来,我一直在尝试使用python的漂亮汤4淘汰Wikipedia的nhl季后赛支架。格式不一致(有时会连续有一个以上的团队,请参见:(https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs)使这一工作变得困难。我想确定当年每个系列的团队,回合和获胜次数。< / p>

最初,我将表格转换为文本,并使用正则表达式来标识团队和信息,但是顺序的变化取决于方括号是否允许每行多于一个团队。

现在,我正在尝试沿行向下移动并计算像单元格/列跨度之类的内容,但结果不一致。我不知道如何确定第四轮球队。

到目前为止,我试图在到达与一个小组的一个单元之前对单元数进行计数...

from bs4 import BeautifulSoup as soup
hockeyteams = ['Anaheim','Arizona','Atlanta','Boston','Buffalo','Calgary','Carolina','Chicago','Colorado','Columbus','Dallas','Detroit',
               'Edmonton','Florida','Hartford','Los Angeles','Minnesota','Montreal','Nashville','New Jersey',
               'Ottawa','Philadelphia','Pittsburgh','Quebec','San Jose','St. Louis','Tampa Bay','Toronto','Vancouver','Vegas','Washington',
               'Winnipeg','NY Rangers','NY Islanders']

#fetch the content from the url from the library
page_response = requests.get(full_link, timeout=5)
#use the html parser to parse the url
page_content = soup(page_response.content, "html.parser")

tables = page_content.find_all('table')
cnt = 0

#identify the appropriate table
for table in tables:
    if ('Semi' in table.text) & ('Stanley Cup Finals' in table.text):
        bracket = table
        break
row_num = 0        
for row in bracket.find_all('tr'):
    row_num += 1
    print(row_num,'#')
    colcnt = 0
    for col in row.find_all('td'):
        if "colspan" in col.attrs:
            colcnt += int(col.attrs['colspan'])
        else:
            colcnt += 1
        if (col.text.strip(' \n') in str(hockeyteams)):
            print(colcnt,col.text)


    print('col width:',colcnt)

最终,我想要一个像这样的数据框,它具有:

A队获胜,B队获胜B
1,坦帕湾,4,纽约岛民,1
2,坦帕湾,4,蒙特利尔,0

1 个答案:

答案 0 :(得分:0)

那张桌子可以被大熊猫刮掉

import pandas as pd
tables = pd.read_html('https://en.wikipedia.org/wiki/2004_Stanley_Cup_playoffs#Playoff_bracket')

bracket = tables[2].dropna(axis=1, how='all').dropna(axis=0, how='all')
print(bracket)

输出充满了NaN,但它具有我认为您正在寻找的内容,您可以使用标准的pandas方法对其进行修改。