我正在尝试从此URL获取数据,使其成为适用于Excel的格式,但是卡住了。通过此代码,我设法将数据分成行,但由于某些原因,它们与行号不对应。有人可以帮忙吗?
import requests
from bs4 import BeautifulSoup
import pandas as pd
import csv
#--------------------------------------------------------------------------------------------------------------------------------------------------#
url = 'http://rotoguru1.com/cgi-bin/hoopstat-daterange.pl?startdate=20181021&date=20181021&saldate=20181021&g=0&ha=&min=&tmptmin=0&tmptmax=999&opptmin=0&opptmax=999&gmptmin=0&gmptmax=999&gameday=&sd=0'
#--------------------------------------------------------------------------------------------------------------------------------------------------#
page_request = requests.get(url)
soup = BeautifulSoup(page_request.text,'lxml')
data = []
for br in soup.findAll('br')[3:][:-1]:
data.append(br.nextSibling)
data_df = pd.DataFrame(data)
print(data_df)
打印结果:
0
0
4943;Abrines, Alex;0;Abrines, Alex;okc;1;0;5....
1
5709;Adams, Jaylen;0;Adams, Jaylen;atl;1;0;0....
2
4574;Adams, Steven;2991235;Adams, Steven;okc;...
3
5696;Akoon-Purcell, DeVaughn;0;Akoon-Purcell,...
4
4860;Anderson, Justin;0;Anderson, Justin;atl;...
5
3510;Anthony, Carmelo;1975;Anthony, Carmelo;h...
答案 0 :(得分:1)
我相信DataFrame
最后一行为空的原因是因为您的解析器。在列表中的最后一个位置,它仍会检查休息后的下一个同级并将空白添加到DataFrame
中。这可以解决问题:
for br in soup.findAll('br')[3:][:-1]:
contents = br.nextSibling
if not contents == "\n":
data.append(br.nextSibling)