麻烦在Python中使用Pandas和numpy合并Scraped数据

时间:2017-08-15 17:00:14

标签: python pandas numpy web-scraping beautifulsoup

我正在尝试从很多不同的网址收集信息,并根据年份和高尔夫球手的名字合并数据。截至目前,我正在尝试将信息写入csv,然后使用pd.merge()进行匹配,但我必须为每个数据帧使用唯一的名称进行合并。我试图使用一个numpy数组,但我仍然坚持要将所有单独的数据合并到最后的过程。

import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import datetime
import socket
import urllib.error
import pandas as pd
import urllib
import sqlalchemy
import numpy as np

base = 'http://www.pgatour.com/'
inn = 'stats/stat'
end = '.html'
years = ['2017','2016','2015','2014','2013']

alpha = []
#all pages with links to tables
urls = ['http://www.pgatour.com/stats.html','http://www.pgatour.com/stats/categories.ROTT_INQ.html','http://www.pgatour.com/stats/categories.RAPP_INQ.html','http://www.pgatour.com/stats/categories.RARG_INQ.html','http://www.pgatour.com/stats/categories.RPUT_INQ.html','http://www.pgatour.com/stats/categories.RSCR_INQ.html','http://www.pgatour.com/stats/categories.RSTR_INQ.html','http://www.pgatour.com/stats/categories.RMNY_INQ.html','http://www.pgatour.com/stats/categories.RPTS_INQ.html']
for i in urls:
    data = urlopen(i)
    soup = BeautifulSoup(data, "html.parser")
    for link in soup.find_all('a'):
        if link.has_attr('href'):
            alpha.append(base + link['href'][17:]) #may need adjusting
#data links
beta = []
for i in alpha:
    if inn in i:
        beta.append(i)
#no repeats
gamma= []
for i in beta:
    if i not in gamma:
        gamma.append(i)

#making list of urls with Statistic labels
jan = []
for i in gamma:
    try:
        data = urlopen(i)
        soup = BeautifulSoup(data, "html.parser")
        for table in soup.find_all('section',{'class':'module-statistics-off-the-tee-details'}):
            for j in table.find_all('h3'):
                y=j.get_text().replace(" ","").replace("-","").replace(":","").replace(">","").replace("<","").replace(">","").replace(")","").replace("(","").replace("=","").replace("+","")
                jan.append([i,str(y+'.csv')])
                print([i,str(y+'.csv')])
    except Exception as e:
            print(e)
            pass

# practice url
#jan = [['http://www.pgatour.com/stats/stat.02356.html', 'Last15EventsScoring.csv']]
#grabbing data
#write to csv
row_sp = []
rows_sp =[]
title1 = [] 
title = []  
for i in jan:
    try:
        with open(i[1], 'w+') as fp:
            writer = csv.writer(fp)
            for y in years:
                data = urlopen(i[0][:-4] +y+ end)
                soup = BeautifulSoup(data, "html.parser")
                data1 = urlopen(i[0])
                soup1 = BeautifulSoup(data1, "html.parser")
                for table in soup1.find_all('table',{'id':'statsTable'}):
                    title.append('year')
                    for k in table.find_all('tr'):
                        for n in k.find_all('th'):
                            title1.append(n.get_text())
                            for l in title1:
                                if l not in title:
                                    title.append(l)
                    rows_sp.append(title)
                for table in soup.find_all('table',{'id':'statsTable'}):
                    for h in table.find_all('tr'):
                        row_sp = [y]
                        for j in h.find_all('td'):
                            row_sp.append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d",""))
                        rows_sp.append(row_sp)
                        print(row_sp)
                        writer.writerows([row_sp])
    except Exception as e:
        print(e)
        pass

dfs = [df1,df2,df3] # store dataframes in one list
df_merge = reduce(lambda  left,right: pd.merge(left,right,on=['v1'], how='outer'), dfs)

网址,统计类型,所需格式 ......只是中间的所有东西 试图将数据放在一行上 以下数据的网址[&#39; http://www.pgatour.com/stats/stat.02356.html&#39;,&#39; http://www.pgatour.com/stats/stat.02568.html&#39;,...,&#39; http://www.pgatour.com/stats/stat.111.html&#39; ]

统计标题

LAST 15 EVENTS - SCORING, SG: APPROACH-THE-GREEN, ..., SAND SAVE PERCENTAGE
year rankthisweek  ranklastweek   name         events   rating    rounds avg
2017 2             3             Rickie Fowler  10      8.8       62    .614    
TOTAL SG:APP   MEASURED ROUNDS   .... %     # SAVES    # BUNKERS    TOTAL O/U PAR
26.386         43                ....70.37    76           108          +7.00

1 个答案:

答案 0 :(得分:3)

更新(每条评论)
这个问题部分涉及技术方法(Pandas merge()),但它似乎也是讨论数据收集和清理的有用工作流程的机会。因此,我添加了比编码解决方案严格要求的更多细节和解释。

您基本上可以使用与原始答案相同的方法来获取不同URL类别的数据。我建议您在迭代URL列表时保留{url:data} dicts列表,然后从该dict构建清理的数据帧。

设置清洁部分涉及一些小腿,因为您需要调整每个URL类别中的不同列。我用手动方法演示了,只使用了几个测试URL。但是,如果您有数千种不同的URL类别,那么您可能需要考虑如何以编程方式收集和组织列名。这感觉超出了这个OP的范围。

只要您确定每个网址中都有yearPLAYER NAME字段,就可以使用以下合并。和以前一样,我们假设您不需要写入CSV,现在让我们不再对您的抓取代码进行任何优化:

首先,在urls中定义网址类别。通过 url category 我指的是http://www.pgatour.com/stats/stat.02356.html实际上会通过在网址中插入一系列年份来实际使用多次,例如:http://www.pgatour.com/stats/stat.02356.2017.html,{{ 1}}。在此示例中,http://www.pgatour.com/stats/stat.02356.2016.html是网址类别,其中包含有关多年播放器数据的信息。

stat.02356.html

现在迭代import pandas as pd # test urls given by OP # note: each url contains >= 1 data fields not shared by the others urls = ['http://www.pgatour.com/stats/stat.02356.html', 'http://www.pgatour.com/stats/stat.02568.html', 'http://www.pgatour.com/stats/stat.111.html'] # we'll store data from each url category in this dict. url_data = {} 。在urls循环中,此代码与我的原始答案完全相同,后者又来自OP - 仅调整了一些变量名称以反映我们的新捕获方法。

urls

现在,对于for url in urls: print("url: ", url) url_data[url] = {"row_sp": [], "rows_sp": [], "title1": [], "title": []} try: #with open(i[1], 'w+') as fp: #writer = csv.writer(fp) for y in years: current_url = url[:-4] +y+ end print("current url is: ", current_url) data = urlopen(current_url) soup = BeautifulSoup(data, "html.parser") data1 = urlopen(url) soup1 = BeautifulSoup(data1, "html.parser") for table in soup1.find_all('table',{'id':'statsTable'}): url_data[url]["title"].append('year') for k in table.find_all('tr'): for n in k.find_all('th'): url_data[url]["title1"].append(n.get_text()) for l in url_data[url]["title1"]: if l not in url_data[url]["title"]: url_data[url]["title"].append(l) url_data[url]["rows_sp"].append(url_data[url]["title"]) for table in soup.find_all('table',{'id':'statsTable'}): for h in table.find_all('tr'): url_data[url]["row_sp"] = [y] for j in h.find_all('td'): url_data[url]["row_sp"].append(j.get_text().replace(" ","").replace("\n","").replace("\xa0"," ").replace("d","")) url_data[url]["rows_sp"].append(url_data[url]["row_sp"]) #print(row_sp) #writer.writerows([row_sp]) except Exception as e: print(e) pass 中的每个键urlurl_data包含您对该特定网址类别感兴趣的数据。
请注意,当我们迭代rows_sp时,rows_sp实际上会url_data[url]["rows_sp"],但接下来的几个代码块来自我的原始答案,因此请使用旧的url_data变量名称。

rows_sp

# example rows_sp [['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 'PLAYER NAME', 'EVENTS', 'RATING', 'year', 'year', 'year', 'year'], ['2017'], ['2017', '1', '1', 'Sam Burns', '1', '9.2'], ['2017', '2', '3', 'Rickie Fowler', '10', '8.8'], ['2017', '2', '2', 'Dustin Johnson', '10', '8.8'], ['2017', '2', '3', 'Whee Kim', '2', '8.8'], ['2017', '2', '3', 'Thomas Pieters', '3', '8.8'], ... ] 直接写入数据框会显示数据格式不正确:

rows_sp

通过一点清理,我们可以将pd.DataFrame(rows_sp).head() 0 1 2 3 4 5 6 \ 0 year RANK THIS WEEK RANK LAST WEEK PLAYER NAME EVENTS RATING year 1 2017 None None None None None None 2 2017 1 1 Sam Burns 1 9.2 None 3 2017 2 3 Rickie Fowler 10 8.8 None 4 2017 2 2 Dustin Johnson 10 8.8 None 7 8 9 0 year year year 1 None None None 2 None None None 3 None None None 4 None None None pd.DataFrame(rows_sp).dtypes 0 object 1 object 2 object 3 object 4 object 5 object 6 object 7 object 8 object 9 object dtype: object 放入具有适当数字数据类型的数据框中:

rows_sp

更新清洁
现在我们有一系列要与之竞争的url类别,每个类别都有一组不同的字段要清理,上面的部分会变得有点复杂。如果您只有几页,那么可以直观地查看每个类别的字段并存储它们,如下所示:

df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0)
df.columns = ["year","RANK THIS WEEK","RANK LAST WEEK",
              "PLAYER NAME","EVENTS","RATING",
              "year1","year2","year3","year4"]
df.drop(["year1","year2","year3","year4"], 1, inplace=True)
df = df.loc[df["PLAYER NAME"].notnull()]
df = df.loc[df.year != "year"]
num_cols = ["RANK THIS WEEK","RANK LAST WEEK","EVENTS","RATING"]
df[num_cols] = df[num_cols].apply(pd.to_numeric)

df.head()
   year  RANK THIS WEEK  RANK LAST WEEK     PLAYER NAME  EVENTS  RATING
2  2017               1             1.0       Sam Burns       1     9.2
3  2017               2             3.0   Rickie Fowler      10     8.8
4  2017               2             2.0  Dustin Johnson      10     8.8
5  2017               2             3.0        Whee Kim       2     8.8
6  2017               2             3.0  Thomas Pieters       3     8.8

然后您可以再次循环cols = {'stat.02568.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 'PLAYER NAME', 'ROUNDS', 'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS', 'year1', 'year2', 'year3', 'year4'], 'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', 'AVERAGE', 'TOTAL SG:APP', 'MEASURED ROUNDS',] }, 'stat.111.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 'PLAYER NAME', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR', 'year1', 'year2', 'year3', 'year4'], 'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'ROUNDS', '%', '# SAVES', '# BUNKERS', 'TOTAL O/U PAR'] }, 'stat.02356.html':{'columns':['year', 'RANK THIS WEEK', 'RANK LAST WEEK', 'PLAYER NAME', 'EVENTS', 'RATING', 'year1', 'year2', 'year3', 'year4'], 'numeric':['RANK THIS WEEK', 'RANK LAST WEEK', 'EVENTS', 'RATING'] } } 并存储在url_data集合中:

dfs

此时,我们已准备好dfs = {} for url in url_data: page = url.split("/")[-1] colnames = cols[page]["columns"] num_cols = cols[page]["numeric"] rows_sp = url_data[url]["rows_sp"] df = pd.DataFrame(rows_sp, columns=rows_sp[0]).drop(0) df.columns = colnames df.drop(["year1","year2","year3","year4"], 1, inplace=True) df = df.loc[df["PLAYER NAME"].notnull()] df = df.loc[df.year != "year"] # tied ranks (e.g. "T9") mess up to_numeric; remove the tie indicators. df["RANK THIS WEEK"] = df["RANK THIS WEEK"].str.replace("T","") df["RANK LAST WEEK"] = df["RANK LAST WEEK"].str.replace("T","") df[num_cols] = df[num_cols].apply(pd.to_numeric) dfs[url] = df merge year所有不同的数据类别。 (你实际上可以在清理循环中迭代地合并,但是为了说明的目的我在这里分开。)

PLAYER NAME

现在master = pd.DataFrame() for url in dfs: if master.empty: master = dfs[url] else: master = master.merge(dfs[url], on=['year','PLAYER NAME']) 包含每个玩家年份的合并数据。以下是使用master

查看数据的视图
groupby()

您可能希望对合并列进行更多清理,因为有些列在数据类别中重复(例如master.groupby(["PLAYER NAME", "year"]).first().head(4) RANK THIS WEEK_x RANK LAST WEEK_x EVENTS RATING \ PLAYER NAME year Aam Hawin 2015 66 66.0 7 8.2 2016 80 80.0 12 8.1 2017 72 45.0 8 8.2 Aam Scott 2013 45 45.0 10 8.2 RANK THIS WEEK_y RANK LAST WEEK_y ROUNDS_x AVERAGE \ PLAYER NAME year Aam Hawin 2015 136 136 95 -0.183 2016 122 122 93 -0.061 2017 56 52 84 0.296 Aam Scott 2013 16 16 61 0.548 TOTAL SG:APP MEASURED ROUNDS RANK THIS WEEK \ PLAYER NAME year Aam Hawin 2015 -14.805 81 86 2016 -5.285 87 39 2017 18.067 61 8 Aam Scott 2013 24.125 44 57 RANK LAST WEEK ROUNDS_y % # SAVES # BUNKERS \ PLAYER NAME year Aam Hawin 2015 86 95 50.96 80 157 2016 39 93 54.78 86 157 2017 6 84 61.90 91 147 Aam Scott 2013 57 61 53.85 49 91 TOTAL O/U PAR PLAYER NAME year Aam Hawin 2015 47.0 2016 43.0 2017 27.0 Aam Scott 2013 11.0 ROUNDS_x)。据我所知,重复的字段名称似乎包含完全相同的信息,因此您可能只删除每个的ROUNDS_y版本。