如何使用python和beautifulsoup4

时间:2015-06-25 23:24:37

标签: python loops csv web-scraping beautifulsoup

我正在尝试从PGA.com网站上获取数据,以获得美国所有高尔夫球场的表格。在我的CSV表格中,我想要包括高尔夫球场的名称,地址,所有权,网站,电话号码。有了这些数据,我想对其进行地理编码并放入地图并在我的计算机上安装本地副本

我利用Python和Beautiful Soup4来提取我的数据。我已经达到了提取数据并将其导入CSV的目的,但我现在遇到了从PGA网站上的多个页面中抓取数据的问题。我想提取所有高尔夫课程,但我的剧本仅限于一页我想要将其循环播放,它将从PGA网站的所有页面中捕获高尔夫球场的所有数据。大约有18000个黄金课程和900个页面来捕获数据

下面是我的剧本。我需要有关创建代码的帮助,这些代码将从PGA网站捕获所有数据,而不仅仅是一个站点而是多个站点。通过这种方式,它将为我提供美国所有黄金课程的数据。

以下是我的脚本:

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

r = requests.get(url)

soup = BeautifulSoup(r.content)

g_data1=soup.find_all("div",{"class":"views-field-nothing-1"})
g_data2=soup.find_all("div",{"class":"views-field-nothing"})

courses_list=[]

for item in g_data2:
     try:
          name=item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
     except:
          name=''
     try:
          address1=item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
     except:
          address1=''
     try:
          address2=item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
     except:
          address2=''
     try:
          website=item.contents[1].find_all("div",{"class":"views-field-website"})[0].text
     except:
          website=''   
     try:
          Phonenumber=item.contents[1].find_all("div",{"class":"views-field-work-phone"})[0].text
     except:
          Phonenumber=''      

     course=[name,address1,address2,website,Phonenumber]
     courses_list.append(course)

     with open ('filename5.csv','wb') as file:
          writer=csv.writer(file)
          for row in courses_list:
               writer.writerow(row)    

#for item in g_data1:
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-counter"})[0].text
     #except:
          #pass  
     #try:
          #print item.contents[1].find_all("div",{"class":"views-field-course-type"})[0].text
     #except:
          #pass

#for item in g_data2:
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-title"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-address"})[0].text
   #except:
      #pass
   #try:
      #print item.contents[1].find_all("div",{"class":"views-field-city-state-zip"})[0].text
   #except:
      #pass

这个脚本一次只捕获20个,我希望在一个脚本中捕获所有1800个高尔夫球场和900个页面的脚本。

6 个答案:

答案 0 :(得分:5)

PGA网站的搜索有多个页面,网址遵循以下模式:

http://www.pga.com/golf-courses/search?page=1 # Additional info after page parameter here

这意味着您可以阅读页面内容,然后将页面值更改为1,并阅读下一页....依此类推。

import csv
import requests 
from bs4 import BeautifulSoup
for i in range(907):      # Number of pages plus one 
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
    r = requests.get(url)
    soup = BeautifulSoup(r.content)

    # Your code for each individual page here 

答案 1 :(得分:3)

如果您仍然阅读此帖子,也可以尝试使用此代码....

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "Details.csv"
f = open(file, "w")
Headers = "Name,Address,City,Phone,Website\n"
f.write(Headers)
for page in range(1,5):
    url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(page)
    html = urlopen(url)
    soup = BeautifulSoup(html,"html.parser")
    Title = soup.find_all("div", {"class":"views-field-nothing"})
    for i in Title:
        try:
            name = i.find("div", {"class":"views-field-title"}).get_text()
            address = i.find("div", {"class":"views-field-address"}).get_text()
            city = i.find("div", {"class":"views-field-city-state-zip"}).get_text()
            phone = i.find("div", {"class":"views-field-work-phone"}).get_text()
            website = i.find("div", {"class":"views-field-website"}).get_text()
            print(name, address, city, phone, website)
            f.write("{}".format(name).replace(",","|")+ ",{}".format(address)+ ",{}".format(city).replace(",", " ")+ ",{}".format(phone) + ",{}".format(website) + "\n")
        except: AttributeError
f.close()

写入范围(1,5)只需将其更改为0,到最后一页,您将获得CSV中的所有详细信息,我非常努力地以正确的格式获取您的数据,但这很难:)。

答案 2 :(得分:2)

您正在为单个页面添加链接,它不会自行迭代每个页面。

第1页:

url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

第2页:

http://www.pga.com/golf-courses/search?page=1&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

第907页: http://www.pga.com/golf-courses/search?page=906&searchbox=Course%20Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0

由于您正在为第1页运行,因此您只能获得20分。您需要创建一个贯穿每个页面的循环。

您可以通过创建一个执行一个页面的函数然后迭代该函数来开始。

在网址search?之后,从第2页开始,page=1开始增加,直到第907页page=906

答案 3 :(得分:0)

我注意到第一个解决方案重复了第一个实例,这是因为0页和1页是同一页。通过在范围功能中指定起始页面可以解决此问题。下面的示例...

     for i in range(1, 907):     #Number of pages plus one
        url = "http://www.pga.com/golf-courses/search?page={}&searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0".format(i)
        r = requests.get(url)
        soup = BeautifulSoup(r.content, "html5lib")   #Can use whichever parser you prefer

# Your code for each individual page here 

答案 4 :(得分:0)

遇到同样的问题,上述解决方案无效。我通过考虑Cookie解决了我的问题。请求会话会有所帮助。创建一个会话,然后通过向所有编号的页面插入cookie来拉出所需的所有页面。

import csv
import requests 
from bs4 import BeautifulSoup
url = "http://www.pga.com/golf-courses/search?searchbox=Course+Name&searchbox_zip=ZIP&distance=50&price_range=0&course_type=both&has_events=0"

s = requests.Session()
r = s.get(url)

答案 5 :(得分:0)

PGA网站已更改,已询问此问题。

似乎他们通过以下方式来组织所有课程:州>城市>课程

鉴于这种变化和此问题的普遍性,以下是我今天如何解决此问题的方法。

第1步-导入我们需要的所有内容:

import time
import random
from gazpacho import Soup   # https://github.com/maxhumber/gazpacho
from tqdm import tqdm       # to keep track of progress

第2步-清除所有状态URL端点:

URL = "https://www.pga.com"

def get_state_urls():
    soup = Soup.get(URL + "/play")
    a_tags = soup.find("ul", {"data-cy": "states"}, mode="first").find("a")
    state_urls = [URL + a.attrs['href'] for a in a_tags]
    return state_urls

state_urls = get_state_urls()

第3步-编写用于刮除所有城市链接的函数:

def get_state_cities(state_url):
    soup = Soup.get(state_url)
    a_tags = soup.find("ul", {"data-cy": "city-list"}).find("a")
    state_cities = [URL + a.attrs['href'] for a in a_tags]
    return state_cities

state_url = state_urls[0]
city_links = get_state_cities(state_url)

第4步-编写函数以刮擦所有课程:

def get_courses(city_link):
    soup = Soup.get(city_link)
    courses = soup.find("div", {"class": "MuiGrid-root MuiGrid-item MuiGrid-grid-xs-12 MuiGrid-grid-md-6"}, mode="all")
    return courses

city_link = city_links[0]
courses = get_courses(city_link)

第5步-编写函数以解析有关课程的所有有用信息:


def parse_course(course):
    return {
        "name": course.find("h5", mode="first").text,
        "address": course.find("div", {'class': "jss332"}, mode="first").strip(),
        "url": course.find("a", mode="first").attrs["href"]
    }

course = courses[0]
parse_course(course)

第6步-遍历所有内容并保存:

all_courses = []
for state_url in tqdm(state_urls):
    city_links = get_state_cities(state_url)
    time.sleep(random.uniform(1, 10) / 10)
    for city_link in city_links:
        courses = get_courses(city_link)
        time.sleep(random.uniform(1, 10) / 10)
        for course in courses:
            info = parse_course(course)
            all_courses.append(info)