我正在从表中删除两列并在HTML上循环脚本(有19页表)。但是,当我输入应该是网页循环的范围时,它会将其设置为要获得的行范围。
我在循环中做错了什么,以便为INSTEAD收集的数据行设置范围,设置我想要删除的HTML页面的范围?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
if not url.ok:
continue
data = url.text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', {'class' : 'wisbb_standardTable'})
player = table.find('a', {'class':'wisbb_fullPlayer'}).find('span').text
team = table.find('span',{'class':'wisbb_tableAbbrevLink'}).find('a').text
empty_list.append((player, team))
df = pd.DataFrame(empty_list, columns=["player", "team"])
df
答案 0 :(得分:1)
当您使用find
时,它会找到第一个元素。您应该使用find_all
代替。这将为您提供所有匹配元素的数组,然后您可以在数组中的每个元素上调用find
以获取所需的数据。你只是抓住每个range(1,n)
页面的第一个团队,玩家对。
此代码似乎可以为您提供所需内容:
import pandas as pd
import csv
empty_list = []
for i in range (1,19):
url = requests.get("https://www.foxsports.com/nhl/stats?season=2017&category=SCORING&group=1&sort=3&time=0&pos=0&team=0&qual=1&sortOrder=0&page={}".format(i))
if not url.ok:
continue
data = url.text
soup = BeautifulSoup(data, 'lxml')
table = soup.find('table', {'class' : 'wisbb_standardTable'})
player = table.find_all('a', {'class':'wisbb_fullPlayer'})
team = table.find_all('span',{'class':'wisbb_tableAbbrevLink'})
player_team_data = [{"player":p.text.split('\n')[1], "team":t.text.strip('\n')} for p,t in zip(player,team)]
for p in player_team_data:
empty_list.append(p)
df = pd.DataFrame(empty_list, columns=["player", "team"])
df.shape
(900, 2)