Python Scrape请求

时间:2017-11-23 18:27:53

标签: python csv web-scraping

当我运行此脚本时,IDLE不会继续。通常它会给出错误。其他脚本运行正常,所以我知道它不是IDLE。我认为我的代码是正确的但也许我错过了一些东西。这不是我将从网站上抓取的所有内容,只是想先看到这项工作,而不是以后我可以完成所有工作。

import csv
import requests
import os

##HOME TEAM

req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=')
data = req.json()

my_data = []
pk = data['resultSets']
for item in data:
    team = item.get['rowSet']

    for item in team:
        Team_Id = item[0]
        Team_Name = item[1]
my_data.append([Team_Id, Team_Name])
headers = ["Team_Id", "Team_Name"]

with open("NBA_Home_Team.csv", "a", newline='') as f:
    writer = csv.writer(f)
    writer.writerow(headers)
    writer.writerows(my_data)
f.close()

##os.system("taskkill /f /im pythonw.exe")

1 个答案:

答案 0 :(得分:1)

似乎它因为服务器没有响应而挂起。可以通过终止进程并检查堆栈跟踪来验证它:

Traceback (most recent call last):   
    req = requests.get('http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0
&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsCo
nference=&VsDivision=')                                             
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 502, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 612, in send                                                                                                                                           
    r = adapter.send(request, **kwargs)                                                                                                                                                                                           
  File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 440, in send                                                                                                                                           
    timeout=timeout                                                                                                                                   
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 600, in urlopen
    chunked=chunked)               
  File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 379, in _make_request
    httplib_response = conn.getresponse(buffering=True)                              
  File "/usr/lib/python2.7/httplib.py", line 1121, in getresponse
    response.begin()                                            
  File "/usr/lib/python2.7/httplib.py", line 438, in begin
    version, status, reason = self._read_status()               
  File "/usr/lib/python2.7/httplib.py", line 394, in _read_status
    line = self.fp.readline(_MAXLINE + 1)                           
  File "/usr/lib/python2.7/socket.py", line 480, in readline
    data = self._sock.recv(self._rbufsize) <-- we're stucking here
KeyboardInterrupt

我尝试在浏览器中打开网址并且工作正常,我在一秒钟内收到回复。然后我开始在代码中调整请求以模仿有效的浏览器。我的第一个想法是使用有效的用户代理,我立即收到了以下代码的回复:

data = requests.get(
    'http://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=10%2F17%2F2017&DateTo=04%2F11%2F2018&Division=&GameScope=&GameSegment=&LastNGames=0&LeagueID=00&Location=Home&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=',
    headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_0 like Mac OS X) AppleWebKit/602.1.38 (KHTML, like Gecko) Version/10.0 Mobile/14A300 Safari/602.1'},
).json()

如果没有有效的User-Agent,也许某种针对僵尸程序的防御机制会导致无响应。

有关代码段的其他说明:

for item in data:

使用pk代替data

team = item.get['rowSet']

使用item['rowSet']item.get('rowSet'),但不要混用它们。 item.get是一项功能,因此无法应用[]

my_data.append([Team_Id, Team_Name])

缩进应与上面的行相同