如何从网站获取统计数据并将其纳入python的DataFrame中?

时间:2019-05-17 18:50:56

标签: python pandas dataframe web-scraping

我正在尝试从以下网站制作一个DataFrame:http://mcubed.net/ncaab/seeds.shtml

我正在尝试将这些列表放入DataFrame中,并查看NCAA锦标赛中每个种子的历史记录。 我对网络抓取并不熟悉,手动输入它需要一段时间。 所以我想知道是否有比手动创建此DataFrame更简单的方法?

我尝试通过制作自己的数据框进行测试,并会从网站手动输入数据,但这是一个漫长的过程

import pandas as pd
data= {"History of 1 Seed":["1 seed versus 1 seed"],
       "History of 2 Seed":["2 seed versus 1 seed"],
       "History of 3 Seed":["3 seed versus 1 seed"],
       "History of 4 Seed":["4 seed versus 1 seed"],
       "History of 5 Seed":["5 seed versus 1 seed"],
       "History of 6 Seed":["6 seed versus 1 seed"],
       "History of 7 Seed":["7 seed versus 1 seed"],
       "History of 8 Seed":["8 seed versus 1 seed"],
       "History of 9 Seed":["9 seed versus 1 seed"],
       "History of 10 Seed":["10 seed versus 1 seed"],
       "History of 11 Seed":["11 seed versus 1 seed"],
       "History of 12 Seed":["12 seed versus 1 seed"],
       "History of 13 Seed":["13 seed versus 1 seed"],
       "History of 14 Seed":["14 seed versus 1 seed"],
       "History of 15 Seed":["16 seed versus 1 seed"],
       "History of 16 Seed":["16 seed versus 1 seed"]

      }
df1= pd.DataFrame(data)
df1

我创建了数据框,但是我不确定如何向其中输入值,并希望有一种更简单的方法来做到这一点。谢谢

1 个答案:

答案 0 :(得分:0)

解析网站

第一步是解析网站,然后将信息放入一个DataFrame或一系列DataFrame中。在这里,我们使用requests的组合来获取文本,使用BeautifulSoup的组合来解析html。您的特定网站的困难之处在于,表格只是文本,而不是特定的html元素。因此,我们必须比平时做些不同。

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO

url = 'http://mcubed.net/ncaab/seeds.shtml'

#Getting the website text
data = requests.get(url).text

#Parsing the website
soup = BeautifulSoup(data, "html5lib")

#Create an empty list
dflist = []

#If we look at the html, we don't want the tag b, but whats next to it
#StringIO(b.next.next), takes the correct text and makes it readable to pandas
for b in soup.findAll({"b"})[2:-1]:
    dflist.append(pd.read_csv(StringIO(b.next.next), sep = r'\s+', header = None))

dflist[0]

     0   1     2      3
0   vs. #1  (23-23) 50.0%
1   vs. #2  (40-35) 53.3%
2   vs. #3  (25-15) 62.5%

清洁和合并数据框

接下来,我们需要格式化列表中的所有数据框。我还决定,我们将合并所有数据框,将团队名称列为一列,并将谁是VS列在另一列中。这样可以轻松进行过滤,以获取我们需要的任何信息。

#We need to create a new list, due to the melt we are going to do not been able to replace
#the dataframes in DFList
meltedDF = []

#The second item in the loop is the team number starting from 1
for df, teamnumber in zip(dflist, (np.arange(len(dflist))+1)):

    #Creating the team name
    name = "Team " + str(teamnumber)

    #Making the team name a column, with the values in df[0] and df[1] in our dataframes
    df[name] = df[0] + df[1]

    #Melting the dataframe to make the team name its own column
    meltedDF.append(df.melt(id_vars = [0, 1, 2, 3]))

# Concat all the melted DataFrames
allTeamStats = pd.concat(meltedDF)

# Final cleaning of our new single DataFrame
allTeamStats = allTeamStats.rename(columns = {0:name, 2:'Record', 3:'Win Percent', 'variable':'Team' , 'value': 'VS'})\
                           .reindex(['Team', 'VS', 'Record', 'Win Percent'], axis = 1)

allTeamStats.head()

     Team    VS     Record  Win Percent
0   Team 1  vs.#1   (23-23) 50.0%
1   Team 1  vs.#2   (40-35) 53.3%
2   Team 1  vs.#3   (25-15) 62.5%
3   Team 1  vs.#4   (53-22) 70.7%
4   Team 1  vs.#5   (45-9)  83.3%

查询新的DF

现在我们将所有信息都存储在一个DataFrame中,我们可以对其进行过滤以提取所需的信息!

allTeamStats[allTeamStats['VS'] == 'vs.#1'].head()

     Team    VS     Record  Win Percent
0   Team 1  vs.#1   (23-23)   50.0%
0   Team 2  vs.#1   (35-40)   46.7%
0   Team 3  vs.#1   (15-25)   37.5%
0   Team 4  vs.#1   (22-53)   29.3%
0   Team 5  vs.#1   (9-45)    16.7%

如果您想要一种更简便的方法来调查团队的赢利和亏损,我们可以进一步创建两个新列,其成败与Record分开。

allTeamStats['Win'] = allTeamStats['Record'].str.extract(r'\((\d+)')
allTeamStats['Lose'] = allTeamStats['Record'].str.extract(r'\(\d+-(\d+)')

allTeamStats.head()

     Team    VS     Record  Win Percent Win Lose
0   Team 1  vs.#1   (23-23)   50.0%     23  23
1   Team 1  vs.#2   (40-35)   53.3%     40  35
2   Team 1  vs.#3   (25-15)   62.5%     25  15
3   Team 1  vs.#4   (53-22)   70.7%     53  22
4   Team 1  vs.#5   (45-9)    83.3%     45  9