循环卡在第一页

时间:2016-10-03 07:38:47

标签: python python-2.7 scripting beautifulsoup

使用漂亮的汤来遍历页面,但无论出于何种原因,我都无法让循环超越第一页。它似乎应该很容易,因为它是一个文本字符串,但它似乎循环回来,也许它的结构不是我的文本字符串?

以下是我所拥有的:

import csv
import urllib2
from bs4 import BeautifulSoup

f = open('nhlstats.csv', "w")


groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']

year = ["2016", "2015","2014","2013","2012"]

for yr in year:
    for gr in groups:
        url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
    #www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
    page = urllib2.urlopen(url)
    soup=BeautifulSoup(page, "html.parser")
    pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
    pageliteral = int(pagecount[5:])
    for i in range(0,pageliteral):
        number = int(((i*40) + 1))
        URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
        page = urllib2.urlopen(url)
        soup=BeautifulSoup(page, "html.parser")
        for tr in soup.select("#my-players-table tr[class*=player]"):
            row =[]
            for ob in range(1,15):
                player_info = tr('td')[ob].get_text(strip=True)
                row.append(player_info)
            f.write(str(yr) +","+",".join(row) + "\n")

f.close()

这反复获得相同的前40个记录。

我尝试使用this solution作为if并且确实发现正在执行

prevLink = soup.select('a[rel="nofollow"]')[0]
newurl =  "http:" + prevLink.get('href')

做得更好,但我不确定如何以这样的方式进行循环?可能只是累了,但我的循环仍然只是进入下一组记录并被卡在那个一个。请帮我修复我的循环

更新

我的格式在复制粘贴中丢失了,我的实际代码如下:

import csv
import urllib2
from bs4 import BeautifulSoup

f = open('nhlstats.csv', "w")


groups=['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']

year = ["2016", "2015","2014","2013","2012"]


for yr in year:
    for gr in groups:
        url = "http://www.espn.com/nhl/statistics/player/_/stat/points/year/"+str(yr)
    #www.espn.com/nhl/statistics/player/_/stat/points/year/2014/
        page = urllib2.urlopen(url)
        soup=BeautifulSoup(page, "html.parser")
        pagecount = soup.findAll(attrs= {"class":"page-numbers"})[0].string
        pageliteral = int(pagecount[5:])
        for i in range(0,pageliteral):
            number = int(((i*40) + 1))
            URL = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/"+str(yr) + "/count/"+str(number)
            page = urllib2.urlopen(url)
            soup=BeautifulSoup(page, "html.parser")
            for tr in soup.select("#my-players-table tr[class*=player]"):
                row =[]
                for ob in range(1,15):
                    player_info = tr('td')[ob].get_text(strip=True)
                    row.append(player_info)
                f.write(str(yr) +","+",".join(row) + "\n")

f.close()

2 个答案:

答案 0 :(得分:1)

您的代码缩进主要是错误的。另外,实际使用您导入的CSV库是明智的,这会自动将播放器名称包装在引号中,以避免内部任何逗号破坏csv结构。

这可以通过查找下一页的链接并提取起始计数来实现。然后用它来构建你的下一页get。如果找不到下一页,它将移至下一年组。请注意,计数不是页数,而是起始条目数。

import csv
import urllib2
from bs4 import BeautifulSoup


groups= ['points', 'shooting', 'goaltending', 'defensive', 'timeonice', 'faceoffs', 'minor-penalties', 'major-penalties']
year = ["2016", "2015", "2014", "2013", "2012"]

with open('nhlstats.csv', "wb") as f_output:
    csv_output = csv.writer(f_output)

    for yr in year:
        for gr in groups:
            start_count = 1
            while True:
                #print "{}, {}, {}".format(yr, gr, start_count)     # show progress

                url = "http://www.espn.com/nhl/statistics/player/_/stat/points/sort/points/year/{}/count/{}".format(yr, start_count)
                page = urllib2.urlopen(url)
                soup = BeautifulSoup(page, "html.parser")

                for tr in soup.select("#my-players-table tr[class*=player]"):
                    row = [yr]
                    for ob in range(1, 15):
                        player_info = tr('td')[ob].get_text(strip=True)
                        row.append(player_info)

                    csv_output.writerow(row)

                try:
                    start_count = int(soup.find(attrs= {"class":"page-numbers"}).find_next('a')['href'].rsplit('/', 1)[1])
                except:
                    break

使用with也会在结尾处自动关闭您的文件。

这将为您提供一个csv文件,如下所示:

2016,"Patrick Kane, RW",CHI,82,46,60,106,17,30,1.29,287,16.0,9,17,20
2016,"Jamie Benn, LW",DAL,82,41,48,89,7,64,1.09,247,16.6,5,17,13
2016,"Sidney Crosby, C",PIT,80,36,49,85,19,42,1.06,248,14.5,9,10,14
2016,"Joe Thornton, C",SJ,82,19,63,82,25,54,1.00,121,15.7,6,8,21

答案 1 :(得分:0)

由于缩进错误,您在第一次打开URL之前多次更改URL。试试这个:

for gr in groups: url = "...some_url..." page = urllib2.urlopen(url) ...everything else should be indented....