如何在“熊猫”列中获取某些数据?

时间:2019-12-02 00:13:28

标签: python pandas

我正在制作一个表,并通过一个名为“ passer_player_name”的变量对其进行分组

data.loc[(data['play_type'] == 'pass') & (data['down'] <= 4)].groupby(by='passer_player_name')[['epa']].mean()
passer_index = data.loc[(data['play_type'] == 'pass') & (data['down'] <= 4)].groupby(by='passer_player_name')[['epa', 'success','yards_gained']].mean()
passer_index['attempts'] = data.loc[(data['play_type'] == 'pass') & (data['down'] <= 4)].groupby(by='passer_player_name')['epa'].count()

这给出了以下输出(一些示例行):

                      epa  success  yards_gained  attempts
passer_player_name      
L.Jackson           0.336     0.48           6.9       335
K.Cousins           0.295     0.50           7.1       363
P.Mahomes           0.285     0.50           7.4       368

接下来我要做的事情要求我使用'passer_player_name'列来对表进行抓取/排序,但是从技术上讲,这不是表的一部分。我尝试执行以下操作:

passer_index['team_names'] = data.loc[(data['play_type'] == 'pass') & (data['down'] <= 4)].groupby(by='passer_player_name').posteam

不幸的是,这在添加的“ team_names”列(这是一个示例行)中提供了以下内容:

(L.Jackson, [BAL, BAL, BAL, BAL, BAL, BAL, BAL...

我怎么会得到一个只说出球队名称一次的列,就像只显示输出“ BAL”的列(每个球员的球队显然不同)?

要弄乱它,因为我显然无法显示整个数据集以及数据的来源,我的问题本质上是:

我如何从显示以下内容的行中获得答案:

(L.Jackson, [BAL, BAL, BAL, BAL, BAL, BAL, BAL...

仅显示“ BAL”的行吗?如何从该系列/序列/任何内容中提取数据?

1 个答案:

答案 0 :(得分:0)

为团队名称创建地图,如下所示:

r = {'K.Murray': 'ARI',
 'M.Ryan': 'ATL',
 'L.Jackson': 'BAL',
 'J.Allen': 'BUF',
 'K.Allen': 'CAR',
 'M.Trubisky': 'CHI',
 'A.Dalton': 'CIN',
 'B.Mayfield': 'CLE',
 'D.Prescott': 'DAL',
 'D.Lock': 'DEN',
 'D.Blough': 'DET',
 'A.Rodgers': 'GRE',
 'D.Watson': 'HOU',
 'J.Brissett': 'IND',
 'N.Foles': 'JAC',
 'P.Mahomes': 'KAN',
 'P.Rivers': 'LOS',
 'J.Goff': 'LOS',
 'R.Fitzpatrick': 'MIA',
 'K.Cousins': 'MIN',
 'T.Brady': 'NEP',
 'D.Brees': 'NOS',
 'D.Jones': 'NYG',
 'S.Darnold': 'NYJ',
 'D.Carr': 'OAK',
 'C.Wentz': 'PHI',
 'D.Hodges': 'PIT',
 'J.Garoppolo': 'SAN',
 'R.Wilson': 'SEA',
 'J.Winston': 'TAM',
 'R.Tannehill': 'TEN',
 'D.Haskins': 'WAS'}

然后您可以像这样合并:

data['team_names'] = data.index.map(r)

输出:

                      epa  success  yards_gained  attempts team_names
passer_player_name                                                        
L.Jackson           0.336     0.48           6.9       335        BAL
K.Cousins           0.295     0.50           7.1       363        MIN
P.Mahomes           0.285     0.50           7.4       368        KCC

我写了一个html剪贴器,建议我可以改变它来帮助您,它可以从https://fantasyfootballers.org/rb-running-back-nfl-stats/中获取所有紧急信息。只要#Look for table部分具有正确的'table'索引,这应该可以刮擦网站上的任何表,因为通常在您要获取的数据之前有一些表,因此可以在其他表上随意尝试网站。我用它来从维基百科上为您获取QB,而该行仅需为table = soup.find_all('table')[0]

import requests
import csv, re
from bs4 import BeautifulSoup

#Main function
def getNFLContent(link, filename):
    #Request content
    result1 = requests.get(link)

    #Save source in var
    src1 = result1.content

    #Activate soup
    soup = BeautifulSoup(src1,'lxml')

    #Look for table
    table = soup.find_all('table')[1]

    #Save in csv
    with open(filename,'w',newline='') as f:
        writer = csv.writer(f)
        for tr in table('tr'):
            #print(tr)
            row = [t.get_text(strip=True)for t in tr(['td','th'])]
            writer.writerow(row)


def abrvname(x):
   initial = x[0].capitalize()
   lnamepat = r'(\w*?$)'
   lname = re.search(lnamepat, x).groups()[0]
   return initial + '.' + lname

link = 'https://fantasyfootballers.org/rb-running-back-nfl-stats/'
filename='rbs.csv'
getNFLContent(link, filename)
df = pd.read_csv('rbs.csv')
df.insert(loc=1, column='abr_name', value=df.Name.apply(abrvname))