在一个Beautiful Soup脚本中刮取多个页面 - 获得相同的结果

时间:2015-09-26 17:17:16

标签: python python-2.7 beautifulsoup

我正在尝试循环使用Python 2.7中的Beautiful Soup解析表的脚本。

第一个表解析工作并产生预期结果。第二个循环产生与第一个循环完全相同的结果 其他细节:

  • 如果我手动使用第二个循环用来解析的url,我明白了 我要抓的目标页面。刷新有一点延迟。
  • 我在其他网站上使用此功能,循环按预期工作。

这是脚本:

    import urllib2
    import csv
    from bs4 import BeautifulSoup # latest version bs4

    week = raw_input("Which week?")
    week = str(week)
    data = []
    first = "http://fantasy.nfl.com/research/projections#researchProjections=researchProjections%2C%2Fresearch%2Fprojections%253Foffset%253D"
    middle = "%2526position%253DO%2526sort%253DprojectedPts%2526statCategory%253DprojectedStats%2526statSeason%253D2015%2526statType%253DweekProjectedStats%2526statWeek%253D"
    last = "%2Creplace"
    page_num = 1
    for page_num in range(1,3):
        page_mult = (page_num-1) * 25 +1
        next = str(page_mult)
        url = first + next + middle + week + last
    print url #I added this in order to check my output
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html,"lxml")
    table = soup.find('table', attrs={'class':'tableType-player hasGroups'})
    table_body = table.find('tbody')

    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele]) # Get rid of empty values
    b = open('NFLtable.csv', 'w')
    a = csv.writer(b)
    a.writerows(data)
    b.close()
    page_num =page_num+1
    print data

1 个答案:

答案 0 :(得分:1)

在实际页面上,他们使用AJAX来请求其他结果,并使用一些HTML作为值之一的JSON响应。

我稍微修改了你的代码,试一试:

import urllib2
import urllib
import csv
from bs4 import BeautifulSoup  # latest version bs4
import json

week = raw_input("Which week?")
week = str(week)
data = []
url_format = "http://fantasy.nfl.com/research/projections?offset={offset}&position=O&sort=projectedPts&statCategory=projectedStats&statSeason=2015&statType=weekProjectedStats&statWeek={week}"

for page_num in range(1, 3):
    page_mult = (page_num - 1) * 25 + 1
    next = str(page_mult)
    url = url_format.format(week=week, offset=page_mult)
    print url  # I added this in order to check my output

    request = urllib2.Request(url, headers={'Ajax-Request': 'researchProjections'})
    raw_json = urllib2.urlopen(request).read()
    parsed_json = json.loads(raw_json)
    html = parsed_json['content']

    soup = BeautifulSoup(html, "html.parser")
    table = soup.find('table', attrs={'class': 'tableType-player hasGroups'})
    table_body = table.find('tbody')

    rows = table_body.find_all('tr')
    for row in rows:
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])  # Get rid of empty values

print data

我用周= 4测试。