尝试循环浏览网页进行数据抓取时出错

时间:2019-02-19 08:11:44

标签: python python-3.x

我已经编写了从第一页提取数据的代码,但是在尝试从所有页面提取数据时遇到了问题。

这是我的代码,用于从“ a”页提取数据

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""

soup = make_soup('https://www.basketball-reference.com/players/a/')

for record in soup.findAll("tr"): 
    playerdata = "" 
    for data in record.findAll(["th","td"]): 
        playerdata = playerdata + "," + data.text 

    playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

print(playerdatasaved)

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketballstats.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

现在循环浏览页面,我的逻辑是这段代码

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""
for letter in ascii_lowercase:
    soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
    for record in soup.findAll("tr"):
        playerdata = "" 
        for data in record.findAll(["th","td"]): 
            playerdata = playerdata + "," + data.text 

        playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

但是,这遇到了与该行有关的错误:     汤= make_soup(“ https://www.basketball-reference.com/players/” +字母+“ /”)

4 个答案:

答案 0 :(得分:1)

我试图运行您的代码,并遇到ssl证书错误CERTIFICATE_VERIFY_FAILED,这似乎与您尝试抓取的wesite而不是您的代码有关。

也许此堆栈可以帮助清除内容: "SSL: certificate_verify_failed" error when scraping https://www.thenewboston.com/

答案 1 :(得分:0)

   for letter in ascii_lowercase:
    soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")

在提供的网址中,当letter ='x'时遇到404错误。似乎该玩家索引不存在,请确保在检查字母时检查该情况。

答案 2 :(得分:0)

与Eman达成协议。 x的页面不可用。只需使用try-catch博客即可忽略该页面。

    try:
        soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
        for record in soup.findAll("tr"):
            playerdata = "" 
            for data in record.findAll(["th","td"]): 
                playerdata = playerdata + "," + data.text 

            playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
    except Exception as e:
        print(e)

答案 3 :(得分:0)

要修复您的代码,我们要做的第一件事是将ascii_lowercase转换为字符串,以便我们可以运行soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")而没有重大例外。只需将您的第一个for更改为此:for letter in str(ascii_lowercase):

接下来的事情是当我们找不到页面时处理异常。例如,"https://www.basketball-reference.com/players/x/"不存在。为此,我们可以使用tryexception

最后但并非最不重要的一点是,您必须忽略表的第一行,否则文件中将有很多Player,From,To,Pos,Ht,Wt,Birth,Date,Colleges。因此,请执行以下操作:

for table in soup.findAll("tbody"):
    for record in table.findAll("tr"):

代替此:

for record in soup.findAll("tr"):

这是整个工作的过程:

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""
for letter in str(ascii_lowercase):
    print(letter) # I added this to see the magic happening
    try:
        soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
        for record in soup.findAll("tr"):
            playerdata = "" 
            for data in record.findAll(["th","td"]): 
                playerdata = playerdata + "," + data.text 

            playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
    except:
        pass

header = "player, from, to, position, height, weight, dob, year,colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))