删除列中具有非数字条目的行(Python)

时间:2019-11-09 17:04:34

标签: python pandas dataframe

我正在尝试从网站下载数据。当我这样做时,有些行不属于所包含的数据,这很明显,因为它们的第一列不是数字。

所以我得到类似的东西

GM_Num     Date               Tm
1          Monday, Apr 3      LAA
2          Tuesday, Apr 4     LAA
...        ...                ...
Gm#        May                Tm

最后一行是我要删除的行。在实际的表格中,整个表格中随机存在多行。

这是到目前为止我尝试删除这些行的代码:

import requests
import pandas as pd

url = 'https://www.baseball-reference.com/teams/LAA/2017-schedule-scores.shtml'
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
df.rename(columns={"Gm#": "GM_Num"}, inplace = True)

#Attempts that didn't work:
df[df['GM_Num'].str.isdigit().isnull()]
#df[df.GM_Num.apply(lambda x: x.isnumeric())].set_index('GM_Num', inplace = True)

#df.set_index('GM_Num', inplace = True)
df

在此先感谢您的帮助!

1 个答案:

答案 0 :(得分:0)

让我们转换“ Gm#”列,并通过以下几个步骤删除记录:

df['Gm#'] = pd.to_numeric(df['Gm#'], errors='coerce')
df = df.dropna(subset=['Gm#'])

df

输出:

       Gm#               Date Unnamed: 2   Tm Unnamed: 4  Opp   W/L  R RA  \
0      1.0      Monday, Apr 3   boxscore  LAA          @  OAK     L  2  4   
1      2.0     Tuesday, Apr 4   boxscore  LAA          @  OAK     W  7  6   
2      3.0   Wednesday, Apr 5   boxscore  LAA          @  OAK     W  5  0   
3      4.0    Thursday, Apr 6   boxscore  LAA          @  OAK     L  1  5   
4      5.0      Friday, Apr 7   boxscore  LAA        NaN  SEA     W  5  1   
..     ...                ...        ...  ...        ...  ...   ... .. ..   
162  158.0  Wednesday, Sep 27   boxscore  LAA          @  CHW  L-wo  4  6   
163  159.0   Thursday, Sep 28   boxscore  LAA          @  CHW     L  4  5   
164  160.0     Friday, Sep 29   boxscore  LAA        NaN  SEA     W  6  5   
165  161.0   Saturday, Sep 30   boxscore  LAA        NaN  SEA     L  4  6   
167  162.0      Sunday, Oct 1   boxscore  LAA        NaN  SEA     W  6  2   

     Inn  ... Rank    GB       Win         Loss       Save  Time D/N  \
0    NaN  ...    3   1.0  Graveman      Nolasco    Casilla  2:56   N   
1    NaN  ...    2   1.0    Bailey         Dull  Bedrosian  3:17   N   
2    NaN  ...    2   1.0   Ramirez       Cotton        NaN  3:15   N   
3    NaN  ...    2   1.0    Triggs       Skaggs        NaN  2:44   D   
4    NaN  ...    1  Tied    Chavez     Gallardo        NaN  2:56   N   
..   ...  ...  ...   ...       ...          ...        ...   ...  ..   
162   10  ...    2  20.0  Farquhar       Parker        NaN  3:58   N   
163  NaN  ...    2  21.0   Infante       Chavez     Minaya  3:04   N   
164  NaN  ...    2  21.0      Wood  Rzepczynski     Parker  3:01   N   
165  NaN  ...    2  21.0  Lawrence    Bedrosian       Diaz  3:32   N   
167  NaN  ...    2  21.0  Bridwell      Simmons        NaN  2:38   D   

    Attendance Streak Orig. Scheduled  
0        36067      -             NaN  
1        11225      +             NaN  
2        13405     ++             NaN  
3        13292      -             NaN  
4        43911      +             NaN  
..         ...    ...             ...  
162      17012      -             NaN  
163      19596     --             NaN  
164      35106      +             NaN  
165      38075      -             NaN  
167      34940      +             NaN  

[162 rows x 21 columns]