将正确的团队与正确的分数值相关联

时间:2013-08-16 06:31:02

标签: python python-3.x html-parsing beautifulsoup

我有一些代码可以从页面http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01输出团队及其所有分数值(不含空格)。

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen("http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01")

content = url.read()

soup = BeautifulSoup(content)

listnames = ''
listscores = ''

for table in soup.find_all('table', class_='scores'):
    for row in table.find_all('tr'):
        for cell in row.find_all('td', class_='yspscores'):
            if cell.text.isdigit():
                listscores += cell.text
        for cell in row.find_all('td', class_='yspscores team'):
            listnames += cell.text

print (listnames)
print (listscores)

我无法解决的问题是我不太明白Python如何使用任何提取的信息并以正确的整数值给出正确的整数值:

Team X: 1, 5, 11.

网站的问题是所有分数属于同一类;所有表都在同一个类下。唯一不同的是href。

1 个答案:

答案 0 :(得分:0)

如果要将值与名称相关联,通常可以使用dict。以下是对代码的修改,以说明原则:

from bs4 import BeautifulSoup
from urllib.request import urlopen

url = urlopen('http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01')

content = url.read()

soup = BeautifulSoup(content)

results = {}

for table in soup.find_all('table', class_='scores'):
    for row in table.find_all('tr'):
        scores = []
        name = None
        for cell in row.find_all('td', class_='yspscores'):
            link = cell.find('a')
            if link:
                name = link.text
            elif cell.text.isdigit():
                scores.append(cell.text)
        if name is not None:
            results[name] = scores

for name, scores in results.items():
    print('%s: %s' % (name, ', '.join(scores)))

...运行时给出此输出:

$ python3 get_scores.py
St. Louis: 1, 2, 1
San Jose: 0, 3, 0
Colorado: 0, 0, 2
Dallas: 0, 0, 0
New Jersey: 0, 1, 0
NY Islanders: 2, 0, 1
Nashville: 0, 0, 2, 0
Minnesota: 0, 1, 0
Detroit: 1, 2, 0
NY Rangers: 1, 1, 2
Anaheim: 0, 3, 1
Winnipeg: 2, 0, 0
Chicago: 1, 1, 0, 0
Calgary: 0, 0, 1
Vancouver: 0, 1, 1
Edmonton: 3, 0, 1
Montreal: 1, 1, 2
Carolina: 1, 0, 0

除了使用字典之外,另一个重要的变化是我们现在正在检查是否存在a元素来获取团队的名称,而不是另外的team类。这真的是一种风格选择,但对我而言,代码似乎更具表现力。