我是python的初学者,我正在尝试使用它从以下位置抓取数据: https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/(和其他此类页面)
我真的只需要球员名称(这里是Sam Bradford),然后是每年年底的总现金值。所以基本上是一张先有年然后有美元的表。
我已经使用了beautifulsoup来获得输出,并且修改了一些代码,然后在一个看起来像表的东西上有了一个终端输出。但是我的最终目标是将其保存为csv或xlsx,以便将其移至Stata之类的程序中。理想情况下,我想针对网站上的每个此类页面自动执行此过程。
到目前为止,我的代码是:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import urllib.request
import csv
import pandas as pd
import requests
from tabulate import tabulate
url = "https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/"
markup = urllib.request.urlopen(url).read()
soup = BeautifulSoup(markup, "lxml")
name_box = soup.title.text.strip()
#print(name_box)
earnings_table = soup.find('table', class_ = "earningstable")
#print(earnings_table.get_text())
rows = earnings_table.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
with open('test.csv', 'a') as csv_file:
#writer = csv.writer(csv_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer = csv.writer(open("/path/SamBradford.csv", 'w'))
writer.writerow([name_box, cols])
这使我得到了一个csv,但是薪水数据全都放在一列中,这没有帮助。
任何有关如何保存此文件然后对网站上其他页面进行自动化的帮助,将不胜感激。
答案 0 :(得分:0)
这是一个可能的解决方案:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
import urllib.request
import csv
import pandas as pd
import requests
url = "https://www.spotrac.com/nfl/arizona-cardinals/sam-bradford-6510/cash-earnings/"
markup = urllib.request.urlopen(url).read()
soup = BeautifulSoup(markup, "lxml")
name_box = soup.title.text.strip()
# I use for pandas for parse html tables to csv
# The problem that I found was your table have 2 tbody ...so I decided make a format
# 'find' will find the first tag
earnings_table = soup.find('table', class_ = "earningstable")
tbody = earnings_table.find('tbody')
thead = earnings_table.find('thead')
table='<table>'+str(thead)+str(tbody)+'</table>'
df = pd.read_html(str(table), flavor="bs4")[0]
df.to_csv('test.csv',index=False)
答案 1 :(得分:0)
尝试pandas
,这是一个用于构建csv文件和表的优秀python软件包。
这是一个起点:
import pandas as pd
data = [
{
'column a': "row 1 column a",
'column b': "row 1 column b",
'column c': "row 1 column c",
},
{
'column a': "row 2 column a",
'column b': "row 2 column b",
'column c': "row 2 column c",
},
]
df = pd.DataFrame(data)
df.to_csv("output.csv")
输出output.csv
:
,column a,column b,column c
0,row 1 column a,row 1 column b,row 1 column c
1,row 2 column a,row 2 column b,row 2 column c