For Loop通过Python

时间:2015-05-11 23:59:47

标签: python python-2.7 web-scraping beautifulsoup python-requests

我是Python的新手,我正在尝试通过一些简单的网页抓取来获取足球统计数据。

我一次成功获取单个页面的数据,但我无法弄清楚如何在我的代码中添加循环以一次刮取多个页面(或多个位置/年/会议)就此而言)。

我在这个网站和其他网站上搜索了相当数量,但我似乎无法做到这一点。

这是我的代码:

import csv
import requests
from BeautifulSoup import BeautifulSoup

url = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=1&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
response = requests.get(url)
html = response.content

soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'data-table1'})

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll('td'):
        text = cell.text.replace('&#39', '')
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

#for line in list_of_rows: print ', '.join(line)

outfile = open("./2014.csv", "wb")
writer = csv.writer(outfile)
writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
writer.writerows(list_of_rows)

outfile.close()

这是我尝试在URL中添加变量并构建循环:

import csv
import requests
from BeautifulSoup import BeautifulSoup

pagelist = ["1", "2", "3"]

x = 0
while (x < 500):
    url = "http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p="+str(x)).read(),'html'+"&d-447263-s=RUSHING_ATTEMPTS_PER_GAME_AVG&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=RUSHING&conference=null&qualified=false"

    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})
    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)


    outfile = open("./2014.csv", "wb")
    writer = csv.writer(outfile)
    writer.writerow(["Rk", "Player", "Team", "Pos", "Att", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Long", "1st", "1st%", "20+", "40+", "FUM"])
    writer.writerows(list_of_rows)
    x = x + 0
    outfile.close()

非常感谢。

这是我修改后的代码,它在写入csv文件时似乎正在删除每个页面。

import csv
import requests
from BeautifulSoup import BeautifulSoup

url_template = 'http://www.nfl.com/stats/categorystats?tabSeq=0&season=2014&seasonType=REG&experience=&Submit=Go&archive=false&d-447263-p=%s&conference=null&statisticCategory=PASSING&qualified=false'

for p in ['1','2','3']:
    url = url_template % p
    response = requests.get(url)
    html = response.content
    soup = BeautifulSoup(html)
    table = soup.find('table', attrs={'class': 'data-table1'})

    list_of_rows = []
    for row in table.findAll('tr'):
        list_of_cells = []
        for cell in row.findAll('td'):
            text = cell.text.replace('&#39', '')
            list_of_cells.append(text)
        list_of_rows.append(list_of_cells)

    #for line in list_of_rows: print ', '.join(line)

        outfile = open("./2014Passing.csv", "wb")
        writer = csv.writer(outfile)
        writer.writerow(["Rk", "Player", "Team", "Pos", "Comp", "Att", "Pct", "Att/G", "Yds", "Avg", "Yds/G", "TD", "Int", "1st", "1st%", "Lng", "20+", "40+", "Sck", "Rate"])
        writer.writerows(list_of_rows)

outfile.close()

1 个答案:

答案 0 :(得分:0)

假设您只想更改页码,您可以执行以下操作并使用string formatting

url_template = 'http://www.nfl.com/stats/categorystats?seasonType=REG&d-447263-n=1&d-447263-o=2&d-447263-p=%s&d-447263-s=PASSING_YARDS&tabSeq=0&season=2014&Submit=Go&experience=&archive=false&statisticCategory=PASSING&conference=null&qualified=false'
for page in [1,2,3]:
  url = url_template % page
  response = requests.get(url)
  # Rest of the processing code can go here
  outfile = open("./2014.csv", "ab")
  writer = csv.writer(outfile)
  writer.writerow(...)
  writer.writerows(list_of_rows)
  outfile.close()

请注意,您应该以附加模式(&#34; ab&#34;)而不是写入模式(&#34; wb&#34;)打开文件,因为后者会覆盖现有内容,因为您需要这样做。经验丰富。使用追加模式,新内容将写入文件末尾。

这超出了问题的范围,更多的是一个友好的代码改进建议,但是如果将它分成更小的函数,每个函数都做一件事,例如从中获取数据,脚本就会变得更容易思考该网站,写入csv等。