使用Python进行Web Scraping - 选择div,h2和h3类

时间:2016-06-23 14:17:22

标签: python html beautifulsoup

这是我第一次使用Python和网页抓取。一直在环顾四周,仍然无法得到我需要做的事情。

以下是我通过Chrome使用的元素的打印屏幕。

我想要做的是,我正在尝试从所选城市名称中获取公寓名称和地址。

List of Apartments in a selected city

import requests
from bs4 import BeautifulSoup

#url = 'http://www.homestead.ca/apartments-for-rent/'                           
rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)                                                   
html = response.content
soup = BeautifulSoup(html,'lxml')

dropdown_list = soup.select(".primary .child-pages a")

#city_names=[dropdown_list_value.text for dropdown_list_value in dropdown_list]
#print (city_names)

cityLinks=[rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]

for cityLinks_select in dropdown_list:                                       #Looping each city from the Apartment drop down list
    print ('Selecting city:',cityLinks_select.text)
    cityResponse = requests.get(cityLinks)
    cityHtml = cityResponse.content
    citySoup = BeautifulSoup(cityHtml,'lxml')

    community_list = soup.select(".extended-search .property-container a[h2 h3]")
    get and print the apartment link
    get and print the apartment name
    get and print the address of the apartment

1 个答案:

答案 0 :(得分:3)

正如我评论的那样,一些数据是动态创建的,如果我们看一下我们看到的源代码本身:

                        <div class="content">
                                    <div class="title-container">
                                        <h2 class="building-name"><%= building.get('name') %></h2>
                                        <h3 class="address"><%= building.get('address').address %></h3>
                                    </div>

                                    <div class="rent">
                                        <h4 class="sub-title">Rent from</h4>
                                        <% if (building.get('statistics').suites.rates.min !== 'undefined') { %>
                                            <% $min_rate = commaSeparateNumber(parseInt(building.get('statistics').suites.rates.min)); %>
                                            <span class="rent-value">$<%= $min_rate %></span>
                                        <% } %>
                                    </div>

我们可以从源头获得的是建筑物名称,地址和电话号码:

cityLinks = [rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list]

# you need to iterate over the joined urls
for city in cityLinks:  # Looping each city from the Apartment drop down list
    cityResponse = requests.get(city)
    cityHtml = cityResponse.content
    citySoup = BeautifulSoup(cityHtml, 'lxml')
    # all the info we can parse is inside the div class="building-info"
    for div in citySoup.select("div.building-info"):
        print(div.select_one("h1.building-name").text.strip())
        print(div.select_one("h2.location").text.strip())
        print(div.select_one("div.contact-container div.phone").text.strip())

如果我们模仿ajax请求,我们可以用 json 格式获取所有数据:

import requests
from bs4 import BeautifulSoup
from pprint import pprint as pp

rootURL = 'http://www.homestead.ca'
response = requests.get(rootURL)
html = response.content
soup = BeautifulSoup(html, 'lxml')

dropdown_list = soup.select(".primary .child-pages a")


cityLinks = (rootURL + dropdown_list_value['href'] for dropdown_list_value in dropdown_list)

# params for our request
params = {"show_promotions": "true",
        "show_custom_fields": "true",
        "client_id": "6",
        "auth_token": "sswpREkUtyeYjeoahA2i",
        "min_bed": "-1",
        "max_bed": "100",
        "min_bath": "0",
        "max_bath": "10",
        "min_rate": "0",
        "max_rate": "4000",
        "keyword": "false",
        "property_types": "low-rise-apartment,mid-rise-apartment,high-rise-apartment,luxury-apartment,townhouse,house,multi-unit-house,single-family-home,duplex,tripex,semi",
        "order": "max_rate ASC, min_rate ASC, min_bed ASC, max_bath ASC",
        "limit": "50",
        "offset": "0",
        "count": "false"}

for city in cityLinks:  # Looping each city from the Apartment drop down list
    with requests.Session() as s:
        r= s.get(city)
        # we need to parse the city_id for out next request to work
        soup = BeautifulSoup(r.content)
        city_id = soup.select_one("div.hidden.search-data")["data-city-id"]
        # update params with the city id
        params["city_id"] = city_id
        js = s.get("http://api.theliftsystem.com/v2/search", params=params).json()
        pp(js)

现在我们得到的数据如下:

[{u'address': {u'address': u'325 North Park Street',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 2X4',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 6,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 752-6855',
               u'name': u'',
               u'phone': u'519-752-3596'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u"Located on North Park Street and Memorial Avenue,this quiet building is within walking distance of the following: - Zehrs Plaza, North Park Plaza, Shoppers Drug Mart, Zehrs Grocery Store, Zellers, Pet Store, Party Supply Store, furniture store, variety store, Black's Photography, paint shop and veterinary clinic\xa0  - Restaurants and coffee shops\xa0  - Wayne Gretzky Recreational Arena\xa0  - Medical Clinic,Shoppers Home Health Care Clinic and Pharmacy\xa0  - Catholic Elementary School\xa0  - On bus route ",
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1703624',
               u'longitude': u'-80.2605725'},
  u'id': 309,
  u'matched_beds': [u'0', u'1', u'2'],
  u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
  u'min_availability_date': u'',
  u'name': u'North Park Tower',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/325-north-park-street-brantford',
  u'pet_friendly': True,
  u'photo': u'1443018148_2.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443018148_2.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
                                             u'max': 1.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'1.0',
                                            u'max': 2,
                                            u'min': 0},
                              u'rates': {u'average': 950.0,
                                         u'max': 1275.0,
                                         u'min': 625.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443018148_2.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}},
 {u'address': {u'address': u'661 West Street',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 6W9',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 6,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 751-0379',
               u'name': u'',
               u'phone': u'519-751-3867'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u'Located in the North end of Brantford, Westgate Tower is in an area that resembles a city within a city. There are a variety of banks, grocery stores, drug stores, malls, a wide selection of fast food, fine dining restaurants and an after hours medical centre, within waking distance.',
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1733242',
               u'longitude': u'-80.2482991'},
  u'id': 310,
  u'matched_beds': [u'0', u'1', u'2'],
  u'matched_suite_names': [u'Bachelor', u'One Bedroom', u'Two Bedroom'],
  u'min_availability_date': u'',
  u'name': u'Westgate Apartments',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/661-west-street-brantford',
  u'pet_friendly': True,
  u'photo': u'1443017488_1.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017488_1.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.0,
                                             u'max': 1.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'1.0',
                                            u'max': 2,
                                            u'min': 0},
                              u'rates': {u'average': 975.0,
                                         u'max': 1300.0,
                                         u'min': 650.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017488_1.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}},
 {u'address': {u'address': u'321 Fairview Drive',
               u'city': u'Brantford',
               u'city_id': 332,
               u'country': u'Canada',
               u'country_code': u'CAN',
               u'intersection': u'',
               u'neighbourhood': u'',
               u'postal_code': u'N3R 2X6',
               u'province': u'Ontario',
               u'province_code': u'ON'},
  u'availability_count': 8,
  u'availability_status': 1,
  u'availability_status_label': u'Available Now',
  u'building_header': u'',
  u'client': {u'email': u'bcadieux@homestead.ca',
              u'id': 6,
              u'name': u'Homestead Land Holdings',
              u'phone': u'613-546-3146',
              u'website': u'www.homestead.ca'},
  u'contact': {u'alt_extension': u'',
               u'alt_phone': u'',
               u'email': u'rentals@homestead.ca',
               u'extension': u'',
               u'fax': u'(519) 752-6855',
               u'name': u'',
               u'phone': u'519-752-3596'},
  u'details': {u'features': u'',
               u'location': u'',
               u'overview': u'Dornia Manor is a quiet, ninety-two unit apartment building located in the North end of Brantford. We offer one, two and three bedroom units and one penthouse suite. The building is located in close proximity to many major services such as banking, shopping, health services, recreational facilities, beauty shops, dry cleaners, schools and churches. There is a bus stop at the front door and highway 403 is within minutes.',
               u'suite': u''},
  u'geocode': {u'distance': None,
               u'latitude': u'43.1706331',
               u'longitude': u'-80.2584034'},
  u'id': 308,
  u'matched_beds': [u'1', u'2', u'3'],
  u'matched_suite_names': [u'One Bedroom', u'Two Bedroom', u'Three Bedroom'],
  u'min_availability_date': u'',
  u'name': u'Dornia Manor',
  u'office_hours': u'',
  u'parking': {u'additional': u'', u'indoor': u'', u'outdoor': u''},
  u'permalink': u'http://www.homestead.ca/apartments/321-fairview-drive-brantford',
  u'pet_friendly': True,
  u'photo': u'1443017947_1.jpg',
  u'photo_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/full/1443017947_1.jpg',
  u'promotion': {u'featured': 0},
  u'property_type': u'High-rise-apartment',
  u'statistics': {u'suites': {u'bathrooms': {u'average': 1.375,
                                             u'max': 2.0,
                                             u'min': 1.0},
                              u'bedrooms': {u'average': u'2.25',
                                            u'max': 3,
                                            u'min': 1},
                              u'rates': {u'average': 1124.5,
                                         u'max': 1350.0,
                                         u'min': 899.0},
                              u'square_feet': {u'average': 0.0,
                                               u'max': u'0.0',
                                               u'min': u'0.0'}}},
  u'thumbnail_path': u'http://s3.amazonaws.com/lws_lift/homestead/images/gallery/256/1443017947_1.jpg',
  u'website': {u'description': u'', u'title': u'', u'url': u''}}]

它为您提供了网址,卧室以及您想要的所有内容。列表中的每个字典都是一个列表,您只需要使用键来访问所需的数据,例如:

 for dct in js:
        add = dct["address"]
        print(add["city"])
        print(add["postal_code"])
        print(add["province"])
        print(dct["permalink"])

会给你:

Brantford
N3R 2X4
Ontario
http://www.homestead.ca/apartments/325-north-park-street-brantford
Brantford
N3R 6W9
Ontario
http://www.homestead.ca/apartments/661-west-street-brantford
Brantford
N3R 2X6
Ontario
http://www.homestead.ca/apartments/321-fairview-drive-brantford

联系信息位于dct["contact"]下,且统计信息不足dct["statistics"]

for dct in js:
        contact = dct["contact"]
        print(contact)
        stats = dct["statistics"]
        print(stats["suites"])

哪会给你:

{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1275.0, u'average': 950.0, u'min': 625.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 751-0379', u'name': u'', u'alt_extension': u'', u'phone': u'519-751-3867', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1300.0, u'average': 975.0, u'min': 650.0}, u'bedrooms': {u'max': 2, u'average': u'1.0', u'min': 0}, u'bathrooms': {u'max': 1.0, u'average': 1.0, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}
{u'alt_phone': u'', u'fax': u'(519) 752-6855', u'name': u'', u'alt_extension': u'', u'phone': u'519-752-3596', u'extension': u'', u'email': u'rentals@homestead.ca'}
{u'rates': {u'max': 1350.0, u'average': 1124.5, u'min': 899.0}, u'bedrooms': {u'max': 3, u'average': u'2.25', u'min': 1}, u'bathrooms': {u'max': 2.0, u'average': 1.375, u'min': 1.0}, u'square_feet': {u'max': u'0.0', u'average': 0.0, u'min': u'0.0'}}

你可以将所有这些放在一起,以获得你需要的任何东西。哟可以调整参数,如果你用铬工具或萤火虫检查请求,实际上会有更多。