我已经编写了从第一页提取数据的代码,但是在尝试从所有页面提取数据时遇到了问题。
这是我的代码,用于从“ a”页提取数据
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
playerdatasaved = ""
soup = make_soup('https://www.basketball-reference.com/players/a/')
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
print(playerdatasaved)
header = "player, from, to, position, height, weight, dob, year,
colleges"+"\n"
file = open(os.path.expanduser("basketballstats.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))
现在循环浏览页面,我的逻辑是这段代码
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
playerdatasaved = ""
for letter in ascii_lowercase:
soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header = "player, from, to, position, height, weight, dob, year,
colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))
但是,这遇到了与该行有关的错误: 汤= make_soup(“ https://www.basketball-reference.com/players/” +字母+“ /”)
答案 0 :(得分:1)
我试图运行您的代码,并遇到ssl证书错误CERTIFICATE_VERIFY_FAILED,这似乎与您尝试抓取的wesite而不是您的代码有关。
也许此堆栈可以帮助清除内容: "SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/
答案 1 :(得分:0)
for letter in ascii_lowercase:
soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
在提供的网址中,当letter ='x'时遇到404错误。似乎该玩家索引不存在,请确保在检查字母时检查该情况。
答案 2 :(得分:0)
与Eman达成协议。 x
的页面不可用。只需使用try-catch
博客即可忽略该页面。
try:
soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
except Exception as e:
print(e)
答案 3 :(得分:0)
要修复您的代码,我们要做的第一件事是将ascii_lowercase转换为字符串,以便我们可以运行soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
而没有重大例外。只需将您的第一个for
更改为此:for letter in str(ascii_lowercase):
。
接下来的事情是当我们找不到页面时处理异常。例如,"https://www.basketball-reference.com/players/x/"
不存在。为此,我们可以使用try
,exception
。
最后但并非最不重要的一点是,您必须忽略表的第一行,否则文件中将有很多Player,From,To,Pos,Ht,Wt,Birth,Date,Colleges
。因此,请执行以下操作:
for table in soup.findAll("tbody"):
for record in table.findAll("tr"):
代替此:
for record in soup.findAll("tr"):
这是整个工作的过程:
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
playerdatasaved = ""
for letter in str(ascii_lowercase):
print(letter) # I added this to see the magic happening
try:
soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
except:
pass
header = "player, from, to, position, height, weight, dob, year,colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))