我正在尝试从以下网站制作一个DataFrame:http://mcubed.net/ncaab/seeds.shtml
我正在尝试将这些列表放入DataFrame中,并查看NCAA锦标赛中每个种子的历史记录。 我对网络抓取并不熟悉,手动输入它需要一段时间。 所以我想知道是否有比手动创建此DataFrame更简单的方法?
我尝试通过制作自己的数据框进行测试,并会从网站手动输入数据,但这是一个漫长的过程
import pandas as pd
data= {"History of 1 Seed":["1 seed versus 1 seed"],
"History of 2 Seed":["2 seed versus 1 seed"],
"History of 3 Seed":["3 seed versus 1 seed"],
"History of 4 Seed":["4 seed versus 1 seed"],
"History of 5 Seed":["5 seed versus 1 seed"],
"History of 6 Seed":["6 seed versus 1 seed"],
"History of 7 Seed":["7 seed versus 1 seed"],
"History of 8 Seed":["8 seed versus 1 seed"],
"History of 9 Seed":["9 seed versus 1 seed"],
"History of 10 Seed":["10 seed versus 1 seed"],
"History of 11 Seed":["11 seed versus 1 seed"],
"History of 12 Seed":["12 seed versus 1 seed"],
"History of 13 Seed":["13 seed versus 1 seed"],
"History of 14 Seed":["14 seed versus 1 seed"],
"History of 15 Seed":["16 seed versus 1 seed"],
"History of 16 Seed":["16 seed versus 1 seed"]
}
df1= pd.DataFrame(data)
df1
我创建了数据框,但是我不确定如何向其中输入值,并希望有一种更简单的方法来做到这一点。谢谢
答案 0 :(得分:0)
解析网站
第一步是解析网站,然后将信息放入一个DataFrame或一系列DataFrame中。在这里,我们使用requests
的组合来获取文本,使用BeautifulSoup
的组合来解析html。您的特定网站的困难之处在于,表格只是文本,而不是特定的html元素。因此,我们必须比平时做些不同。
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
url = 'http://mcubed.net/ncaab/seeds.shtml'
#Getting the website text
data = requests.get(url).text
#Parsing the website
soup = BeautifulSoup(data, "html5lib")
#Create an empty list
dflist = []
#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to pandas
for b in soup.findAll({"b"})[2:-1]:
dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header = None))
dflist[0]
0 1 2 3
0 vs. #1 (23-23) 50.0%
1 vs. #2 (40-35) 53.3%
2 vs. #3 (25-15) 62.5%
清洁和合并数据框
接下来,我们需要格式化列表中的所有数据框。我还决定,我们将合并所有数据框,将团队名称列为一列,并将谁是VS列在另一列中。这样可以轻松进行过滤,以获取我们需要的任何信息。
#We need to create a new list, due to the melt we are going to do not been able to replace
#the dataframes in DFList
meltedDF = []
#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):
#Creating the team name
name = "Team " + str(teamnumber)
#Making the team name a column, with the values in df[0] and df[1] in our dataframes
df[name] = df[0] + df[1]
#Melting the dataframe to make the team name its own column
meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))
# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)
# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win Percent', 'variable':'Team' , 'value': 'VS'})\
.reindex(['Team', 'VS', 'Record', 'Win Percent'], axis = 1)
allTeamStats.head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
1 Team 1 vs.#2 (40-35) 53.3%
2 Team 1 vs.#3 (25-15) 62.5%
3 Team 1 vs.#4 (53-22) 70.7%
4 Team 1 vs.#5 (45-9) 83.3%
查询新的DF
现在我们将所有信息都存储在一个DataFrame中,我们可以对其进行过滤以提取所需的信息!
allTeamStats[allTeamStats['VS'] == 'vs.#1'].head()
Team VS Record Win Percent
0 Team 1 vs.#1 (23-23) 50.0%
0 Team 2 vs.#1 (35-40) 46.7%
0 Team 3 vs.#1 (15-25) 37.5%
0 Team 4 vs.#1 (22-53) 29.3%
0 Team 5 vs.#1 (9-45) 16.7%
如果您想要一种更简便的方法来调查团队的赢利和亏损,我们可以进一步创建两个新列,其成败与Record分开。
allTeamStats['Win'] = allTeamStats['Record'].str.extract(r'\((\d+)')
allTeamStats['Lose'] = allTeamStats['Record'].str.extract(r'\(\d+-(\d+)')
allTeamStats.head()
Team VS Record Win Percent Win Lose
0 Team 1 vs.#1 (23-23) 50.0% 23 23
1 Team 1 vs.#2 (40-35) 53.3% 40 35
2 Team 1 vs.#3 (25-15) 62.5% 25 15
3 Team 1 vs.#4 (53-22) 70.7% 53 22
4 Team 1 vs.#5 (45-9) 83.3% 45 9