使用python进行Web解析时复制Header行

时间:2018-04-09 19:05:57

标签: python parsing header

我正在观看有关网络解析的教程。自视频创建以来,网站本身已经发生了变化,因此我不得不添加几行,现在脚本创建的csv文件有两个标题行。有人可以帮我弄清楚我需要做些什么来纠正这个问题?谢谢!这是我的代码:

import urllib
import urllib.request
from bs4 import BeautifulSoup
import os

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage,"html.parser")
    return soupdata

playerdatasaved = ""
soup = make_soup("https://www.basketball-reference.com/players/a/")

for record in soup.findAll('tr'):
    playerdata = ""

    for data in record.findAll('th'):                <------ Added this line
        playerdata = playerdata + "," + data.text    <------ Added this line
        for data in record.findAll('td'):
            playerdata = playerdata + "," + data.text

    if len(playerdata) != 0:
        playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header = "Player, From, To, Pos, Ht, Wt, Birth Date, Colleges"

file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding="ascii", errors='ignore'))
file.write(bytes(playerdatasaved,"ascii", errors='ignore'))

csv文件标题显示以下内容:

球员从出生日期大学出生 球员从出生日期大学毕业

我试过删除文件命令中的头变量和标题,但无济于事。谢谢!

1 个答案:

答案 0 :(得分:0)

正如我在评论中所说,您需要删除一组标题,最好是代码中的一个标题,并将其保留在网页中。只需删除以下行:

header = "Player, From, To, Pos, Ht, Wt, Birth Date, Colleges"
file.write(bytes(header, encoding="ascii", errors='ignore'))