Python BeautifulSoup中的CSS选择器

时间:2017-03-28 13:35:46

标签: python web-scraping beautifulsoup

我已经构建了一个非常简单的刮刀,查看Airbnb列表。目标是通过一个给定的网站(即this one)。

first_page = BeautifulSoup(requests.get("https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1").text, 'html.parser')
listings = first_page.find_all('div', 'listing-card-wrapper')
for listing in listings:
    print(listing.select("#listing-15616363 > div.infoContainer_v72lrv > a > div.ellipsized_1iurgbx > div > span:nth-child(1) > span:nth-child(1)"))

代码正确地遍历页面上的18个元素。但是,它会打印18个空数组,表明listing.select语句不起作用。我从Chrome开发工具复制选择器功能中获得了CSS标记。

2 个答案:

答案 0 :(得分:3)

这是因为listing-15616363特定于每个商家信息(请注意格式listing-{listing_id}),因此在您的循环商家信息中没有类id = 'listing-15616363'

例如,如果你想获取网址,你可以这样做:

listing.find('a', class_ = "linkContainer_55zci1")['href']

或者,您可以使用比 BeautifulSoup (如果使用得当)快一个数量级的python lxml ,如下所示:

import requests
from lxml import html

url = "https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1"

response = requests.get(url)
root = html.fromstring(response.content)
result_list = []

def remove_non_ascii(text) :
    return ''.join([i if ord(i) < 128 else '' for i in text])

currency = root.xpath('//div[@itemprop="offers"]/meta[@itemprop="priceCurrency"]/@content')[0].strip()

for row in root.xpath('//div[contains(@class, "listing-card-wrapper")]') : 
    if row :
        url = row.xpath('.//a[@class="linkContainer_55zci1"]/@href')[0].strip()
        title = row.xpath('.//div[@class="ellipsized_1iurgbx"]/span/text()')[0].strip()
        price = remove_non_ascii(row.xpath('.//div[@class="inline_g86r3e"]/span//text()')[0].strip())

        result_list.append({'url' : "https://www.airbnb.com" + url, 
            'title' : title, 'price' : price, 'currency' : currency})

print result_list

这将导致:

[{'url': 'https://www.airbnb.com/rooms/5316912', 'currency': 'INR', 'price': u' 3,823', 'title': 'Small City  apt. next to the Metro'}, {'url': 'https://www.airbnb.com/rooms/16989400', 'currency': 'INR', 'price': u' 2,347', 'title': 'Cozy room close to city center'}, {'url': 'https://www.airbnb.com/rooms/17628374', 'currency': 'INR', 'price': u' 6,774', 'title': 'Cosy, quiet apartment in downtown Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/1206721', 'currency': 'INR', 'price': u' 4,426', 'title': 'Apt.close to Metro, Airport and CHP'}, {'url': 'https://www.airbnb.com/rooms/13813273', 'currency': 'INR', 'price': u' 3,622', 'title': 'Large room in Vesterbro'}, {'url': 'https://www.airbnb.com/rooms/14083881', 'currency': 'INR', 'price': u' 9,322', 'title': 'City Room'}, {'url': 'https://www.airbnb.com/rooms/6221130', 'currency': 'INR', 'price': u' 5,365', 'title': 'cosy flat 2 min from Central Statio'}, {'url': 'https://www.airbnb.com/rooms/15804159', 'currency': 'INR', 'price': u' 3,823', 'title': 'Cozy, central near waterfront. Quality breakfast!'}, {'url': 'https://www.airbnb.com/rooms/17266268', 'currency': 'INR', 'price': u' 3,756', 'title': 'Cosy room in Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/2647233', 'currency': 'INR', 'price': u' 3,353', 'title': 'Bedroom & Living Room Frederiksberg'}, {'url': 'https://www.airbnb.com/rooms/12083235', 'currency': 'INR', 'price': u' 5,969', 'title': 'Wonderful Copenhagen is right here'}, {'url': 'https://www.airbnb.com/rooms/7787976', 'currency': 'INR', 'price': u' 7,042', 'title': 'Homely renovated flat with garden'}, {'url': 'https://www.airbnb.com/rooms/17556785', 'currency': 'INR', 'price': u' 1,610', 'title': u'Small Cosy home above our Caf\xe9 ( Breakfast incl )'}, {'url': 'https://www.airbnb.com/rooms/894420', 'currency': 'INR', 'price': u' 10,261', 'title': 'Wonderful apt. right in the city!'}, {'url': 'https://www.airbnb.com/rooms/17028460', 'currency': 'INR', 'price': u' 7,847', 'title': 'Nyhavn 3-bed apartment for families'}, {'url': 'https://www.airbnb.com/rooms/17651114', 'currency': 'INR', 'price': u' 6,371', 'title': 'Spacious place by canals in heart of Copenhagen'}, {'url': 'https://www.airbnb.com/rooms/10564051', 'currency': 'INR', 'price': u' 3,420', 'title': u'\u623f\u95f4\u5728\u54e5\u672c\u54c8\u6839\u7684\u5fc3\u810f'}, {'url': 'https://www.airbnb.com/rooms/17709435', 'currency': 'INR', 'price': u' 2,951', 'title': u'Hyggelig lejlighed t\xe6t p\xe5 centrum.'}]

您还可以参考scrapinglxml的文档以进一步了解。

答案 1 :(得分:1)

当网络抓取尝试使用xpath或特定元素属性而不是css选择器时,因为它们通常对每个元素都过于具体。

我没有使用css选择器,而是通过使用以下代码中的itemprop属性实现了您想要的目标:

<强>代码:

from bs4 import BeautifulSoup
import requests

html_source = requests.get("https://www.airbnb.com/s/Copenhagen--Denmark/homes?allow_override%5B%5D=&s_tag=kHqeQTpz&section_offset=1").text
first_page = BeautifulSoup(html_source, 'html.parser')

listings = first_page.find_all('div', {'itemprop':'itemListElement'})

for l in listings:
    a = l.find_next('meta')
    b = a.find_next('meta')
    c = b.find_next('meta')

    print("Name: ", a['content'])
    print("Position: ", b['content'])
    print("URL: ", c['content'])

    print("-"*15)    

<强>输出:

Name:  Small City  apt. next to the Metro - Apartment - København
Position:  1
URL:  www.airbnb.com/rooms/5316912
---------------
Name:  Cozy room close to city center - Apartment - Frederiksberg
Position:  2
URL:  www.airbnb.com/rooms/16989400
---------------
Name:  Cosy, quiet apartment in downtown Copenhagen - Apartment - København
Position:  3
URL:  www.airbnb.com/rooms/17628374
---------------
Name:  Apt.close to Metro, Airport and CHP - Apartment - Copenhagen
Position:  4
URL:  www.airbnb.com/rooms/1206721
---------------
Name:  Large room in Vesterbro - Apartment - København
Position:  5
URL:  www.airbnb.com/rooms/13813273
---------------
Name:  City Room - Apartment - København
Position:  6
URL:  www.airbnb.com/rooms/14083881
---------------
Name:  cosy flat 2 min from Central Statio - Apartment - København V
Position:  7
URL:  www.airbnb.com/rooms/6221130
---------------
Name:  Cozy, central near waterfront. Quality breakfast! - Apartment - København
Position:  8
URL:  www.airbnb.com/rooms/15804159
---------------
Name:  Cosy room in Frederiksberg - Apartment - Frederiksberg
Position:  9
URL:  www.airbnb.com/rooms/17266268
---------------
Name:  Bedroom & Living Room Frederiksberg - Apartment - Frederiksberg
Position:  10
URL:  www.airbnb.com/rooms/2647233
---------------
Name:  Wonderful Copenhagen is right here - Apartment - København
Position:  11
URL:  www.airbnb.com/rooms/12083235
---------------
Name:  Homely renovated flat with garden - Apartment - Frederiksberg
Position:  12
URL:  www.airbnb.com/rooms/7787976
---------------
Name:  Small Cosy home above our Café ( Breakfast incl ) - Bed & Breakfast - København
Position:  13
URL:  www.airbnb.com/rooms/17556785
---------------
Name:  Wonderful apt. right in the city! - Apartment - Copenhagen
Position:  14
URL:  www.airbnb.com/rooms/894420
---------------
Name:  Nyhavn 3-bed apartment for families - Apartment - Copenhagen
Position:  15
URL:  www.airbnb.com/rooms/17028460
---------------
Name:  Spacious place by canals in heart of Copenhagen - Apartment - København
Position:  16
URL:  www.airbnb.com/rooms/17651114
---------------
Name:  房间在哥本哈根的心脏 - Apartment - København
Position:  17
URL:  www.airbnb.com/rooms/10564051
---------------
Name:  Hyggelig lejlighed tæt på centrum. - Apartment - København
Position:  18
URL:  www.airbnb.com/rooms/17709435
---------------