如何将BeautifulSoup的输出另存为csv?

时间:2019-12-09 22:06:48

标签: python beautifulsoup

我是python的初学者,我正在尝试使用它从以下位置抓取数据: https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/(和其他此类页面)

我真的只需要球员名称(这里是Sam Bradford),然后是每年年底的总现金值。所以基本上是一张先有年然后有美元的表。

我已经使用了beautifulsoup来获得输出,并且修改了一些代码,然后在一个看起来像表的东西上有了一个终端输出。但是我的最终目标是将其保存为csv或xlsx,以便将其移至Stata之类的程序中。理想情况下,我想针对网站上的每个此类页面自动执行此过程。

到目前为止,我的代码是:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import urllib.request
import csv
import pandas as pd
import requests
from tabulate import tabulate

url = "https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/"

markup = urllib.request.urlopen(url).read()

soup = BeautifulSoup(markup, "lxml")

name_box = soup.title.text.strip()
#print(name_box)

earnings_table = soup.find('table', class_ = "earningstable")
#print(earnings_table.get_text())
rows = earnings_table.find_all('tr')
for row in rows:
    cols=row.find_all('td')
    cols=[x.text.strip() for x in cols]
    print(cols)

with open('test.csv', 'a') as csv_file:
    #writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    writer = csv.writer(open("/path/SamBradford.csv", 'w'))
    writer.writerow([name_box, cols])

这使我得到了一个csv,但是薪水数据全都放在一列中,这没有帮助。

任何有关如何保存此文件然后对网站上其他页面进行自动化的帮助,将不胜感激。

2 个答案:

答案 0 :(得分:0)

这是一个可能的解决方案:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import urllib.request
import csv
import pandas as pd
import requests

url = "https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/"
markup = urllib.request.urlopen(url).read()
soup = BeautifulSoup(markup, "lxml")
name_box = soup.title.text.strip()
# I use for pandas for parse html tables to csv
# The problem that I found was your table have 2 tbody ...so I decided make a format
# 'find' will find the first tag
earnings_table = soup.find('table', class_ = "earningstable")
tbody = earnings_table.find('tbody')
thead = earnings_table.find('thead')
table='<table>'+str(thead)+str(tbody)+'</table>'
df = pd.read_html(str(table), flavor="bs4")[0]
df.to_csv('test.csv',index=False)

答案 1 :(得分:0)

尝试pandas,这是一个用于构建csv文件和表的优秀python软件包。

这是一个起点:

import pandas as pd

data = [
    {
        'column a': "row 1 column a",
        'column b': "row 1 column b",
        'column c': "row 1 column c",
    },
    {
        'column a': "row 2 column a",
        'column b': "row 2 column b",
        'column c': "row 2 column c",
    },
]

df = pd.DataFrame(data)
df.to_csv("output.csv")

输出output.csv

,column a,column b,column c
0,row 1 column a,row 1 column b,row 1 column c
1,row 2 column a,row 2 column b,row 2 column c