PANDAS Web Scraping Multiple Pages

时间:2017-11-09 18:40:48

标签: python pandas web web-scraping beautifulsoup

我正在使用Beautiful pages在以下给定网站的多个页面上抓取数据,并且能够做到这一点。我可以使用Pandas刮取多个页面的数据。以下是抓取单个页面的代码,URL链接到http://www.example.org/whats-on/calendar?page=3的其他页面。

import pandas as pd
url = 'http://www.example.org/whats-on/calendar?page=3'
dframe = pd.read_html(url,header=0)
dframe[0]
dframe[0].to_csv('out.csv')

2 个答案:

答案 0 :(得分:1)

简单地循环遍历数字范围并附加到数据框列表。然后,连接到一个大文件。您当前代码的一个问题是header=None是默认的第一行。但是,页面没有列标题。因此,使用import pandas as pd dfs = [] # PAGES 0 - 3 SCRAPE url = 'http://www.lapl.org/whats-on/calendar?page={}' for i in range(4): dframe = pd.read_html(url.format(i), header=None)[0]\ .rename(columns={0:'Date', 1:'Topic', 2:'Location', 3:'People', 4:'Category'}) dfs.append(dframe) finaldf = pd.concat(dfs) finaldf.to_csv('Output.csv') 然后重命名列。

下面抓取第0 - 3页。扩展其他页面的循环限制。

print(finaldf.head())
#                                    Date                                              Topic                                         Location                             People                     Category
#  0  Thu, Nov 09, 201710:00am to 12:30pm  California Healthier Living : A Chronic Diseas...                West Los Angeles Regional Library                            Seniors                       Health
#  1  Thu, Nov 09, 201710:00am to 11:30am  Introduction to Microsoft WordLearn the basics...  North Hollywood Amelia Earhart Regional Library       Adults, Job Seekers, Seniors               Computer Class
#  2             Thu, Nov 09, 201711:00am                     Board of Library Commissioners                                  Central Library                             Adults                      Meeting
#  3   Thu, Nov 09, 201712:00pm to 1:00pm  Tech TryOutCentral Library LobbyDid you know t...                                  Central Library                      Adults, Teens               Computer Class
#  4   Thu, Nov 09, 201712:00pm to 1:30pm  Taller de Tejido/ Crochet WorkshopLearn how to...                 Benjamin Franklin Branch Library  Adults, Seniors, Spanish Speakers  Arts and Crafts, En Español

<强>输出

{{1}}

答案 1 :(得分:0)

下面的代码将遍历以下范围内的页面,并附加到具有选定字段的数据框。

def get_from_website():
    Sample = pd.DataFrame()
    for num in range(1,6):
        website = 'https://weburl/?page=' + str(num)
        datalist = pd.read_html(website)
        Sample= Sample.append(datalist[0])
    Sample.columns=['Field1', 'Field2', 'Field3', 'Field4', 'Field5', 'Field6', 'Time', 'Field7', 'Field8' ]
    return Sample