Python刮掉网页循环

时间:2018-01-27 02:45:20

标签: python pandas

我正在从表中删除两列并在HTML上循环脚本(有19页表)。但是,当我输入应该是网页循环的范围时,它会将其设置为要获得的行范围。

我在循环中做错了什么,以便为INSTEAD收集的数据行设置范围,设置我想要删除的HTML页面的范围?

import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
    url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    table = soup.find('table', {'class' : 'wisbb_standardTable'})
    player = table.find('a', {'class':'wisbb_fullPlayer'}).find('span').text
    team = table.find('span',{'class':'wisbb_tableAbbrevLink'}).find('a').text
    empty_list.append((player, team))
df = pd.DataFrame(empty_list, columns=["player", "team"])
df

sample table data

1 个答案:

答案 0 :(得分:1)

当您使用find时,它会找到第一个元素。您应该使用find_all代替。这将为您提供所有匹配元素的数组,然后您可以在数组中的每个元素上调用find以获取所需的数据。你只是抓住每个range(1,n)页面的第一个团队,玩家对。

此代码似乎可以为您提供所需内容:

import pandas as pd
import csv
empty_list = []
for i in range (1,19):
    url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
    if not url.ok:
        continue
    data = url.text
    soup = BeautifulSoup(data, 'lxml')
    table = soup.find('table', {'class' : 'wisbb_standardTable'})
    player = table.find_all('a', {'class':'wisbb_fullPlayer'})
    team = table.find_all('span',{'class':'wisbb_tableAbbrevLink'})
    player_team_data = [{"player":p.text.split('\n')[1], "team":t.text.strip('\n')} for p,t in zip(player,team)]
    for p in player_team_data:
        empty_list.append(p)
df = pd.DataFrame(empty_list, columns=["player", "team"])

df.shape

(900, 2)