Question

import requests
from bs4 import BeautifulSoup
import csv
from urlparse import urljoin
import urllib2
from lxml import html

base_url = 'http://www.pro-football-reference.com' # base url for concatenation
data = requests.get("http://www.pro-football-reference.com/years/2014/games.htm") #website for scraping
soup = BeautifulSoup(data.content)
list_of_cells = []

for link in soup.find_all('a'):
    if link.has_attr('href'):
        if link.get_text() == 'boxscore':
            url = base_url + link['href']
            for x in url:
                response = requests.get('x')
                html = response.content
                soup = BeautifulSoup(html)
                table = soup.find('table', attrs={'class': 'stats_table x_large_text'})
                for row in table.findAll('tr'):
                    for cell in row.findAll('td'):
                        text = cell.text.replace('&nbsp;', '')
                        list_of_cells.append(text)
                        print list_of_cells

我正在使用该代码来获取http://www.pro-football-reference.com/years/2014/games.htm的所有箱形分数网址。在我得到这些盒子分数网址之后，我想循环遍历它们以便为每个团队划分四分之一数据，但无论我如何格式化代码，我的语法似乎总是关闭。

如果可能的话，我想通过获得游戏信息，官员和每场比赛的预期得分来获取更多的得分数据。

Answer 1

如果您将循环略微修改为：

for link in soup.find_all('a'):

    if not link.has_attr('href'):
        continue

    if link.get_text() != 'boxscore':
        continue

    url = base_url + link['href']

    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)

    # Scores
    table = soup.find('table', attrs={'id': 'scoring'})
    for row in table.findAll('tr'):
        for cell in row.findAll('td'):
            text = cell.text.replace('&nbsp;', '')
            list_of_cells.append(text)
            print list_of_cells

为每个链接到“boxscore”文本的页面返回评分table中每一行的每个单元格。

我在现有代码中发现的问题是：

您试图循环浏览“boxscore”链接返回的href中的每个字符。
您总是在请求字符串'x'。
不是一个问题，但是我更改了表格选择器，以便通过id'得分'而不是class来识别表格。至少在页面中应该是唯一的ID（虽然没有保证）。

我建议您在主循环中找到包含所需数据的每个table（或HTML元素）（例如score_table = soup.find('table'...），但是您移动解析该数据的代码（例如）...

for row in table.findAll('tr'):
    for cell in row.findAll('td'):
        text = cell.text.replace('&nbsp;', '')
        list_of_cells.append(text)
        print list_of_cells

...进入一个单独的函数，返回所述数据（每个类型的数据一个用于提取），只是为了使代码更易于管理。处理if测试和for循环的代码缩进越多，遵循流程就越困难。例如：

score_table = soup.find('table', attrs={'id': 'scoring'})
score_data = parse_score_table(score_table)

other_table = soup.find('table', attrs={'id': 'other'})
other_data = parse_other_table(other_table)

抓取数据时出现语法问题

1 个答案: