我正在尝试从网站下载数据。当我这样做时,有些行不属于所包含的数据,这很明显,因为它们的第一列不是数字。
所以我得到类似的东西
GM_Num Date Tm
1 Monday, Apr 3 LAA
2 Tuesday, Apr 4 LAA
... ... ...
Gm# May Tm
最后一行是我要删除的行。在实际的表格中,整个表格中随机存在多行。
这是到目前为止我尝试删除这些行的代码:
import requests
import pandas as pd
url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)
#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)
#df.set_index('GM_Num', inplace = True)
df
在此先感谢您的帮助!
答案 0 :(得分:0)
让我们转换“ Gm#”列,并通过以下几个步骤删除记录:
df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])
df
输出:
Gm# Date Unnamed: 2 Tm Unnamed: 4 Opp W/L R RA \
0 1.0 Monday, Apr 3 boxscore LAA @ OAK L 2 4
1 2.0 Tuesday, Apr 4 boxscore LAA @ OAK W 7 6
2 3.0 Wednesday, Apr 5 boxscore LAA @ OAK W 5 0
3 4.0 Thursday, Apr 6 boxscore LAA @ OAK L 1 5
4 5.0 Friday, Apr 7 boxscore LAA NaN SEA W 5 1
.. ... ... ... ... ... ... ... .. ..
162 158.0 Wednesday, Sep 27 boxscore LAA @ CHW L-wo 4 6
163 159.0 Thursday, Sep 28 boxscore LAA @ CHW L 4 5
164 160.0 Friday, Sep 29 boxscore LAA NaN SEA W 6 5
165 161.0 Saturday, Sep 30 boxscore LAA NaN SEA L 4 6
167 162.0 Sunday, Oct 1 boxscore LAA NaN SEA W 6 2
Inn ... Rank GB Win Loss Save Time D/N \
0 NaN ... 3 1.0 Graveman Nolasco Casilla 2:56 N
1 NaN ... 2 1.0 Bailey Dull Bedrosian 3:17 N
2 NaN ... 2 1.0 Ramirez Cotton NaN 3:15 N
3 NaN ... 2 1.0 Triggs Skaggs NaN 2:44 D
4 NaN ... 1 Tied Chavez Gallardo NaN 2:56 N
.. ... ... ... ... ... ... ... ... ..
162 10 ... 2 20.0 Farquhar Parker NaN 3:58 N
163 NaN ... 2 21.0 Infante Chavez Minaya 3:04 N
164 NaN ... 2 21.0 Wood Rzepczynski Parker 3:01 N
165 NaN ... 2 21.0 Lawrence Bedrosian Diaz 3:32 N
167 NaN ... 2 21.0 Bridwell Simmons NaN 2:38 D
Attendance Streak Orig. Scheduled
0 36067 - NaN
1 11225 + NaN
2 13405 ++ NaN
3 13292 - NaN
4 43911 + NaN
.. ... ... ...
162 17012 - NaN
163 19596 -- NaN
164 35106 + NaN
165 38075 - NaN
167 34940 + NaN
[162 rows x 21 columns]