Question

我正在尝试从booking.com抓取数据，现在几乎所有东西都在运行，但我无法得到价格，我到目前为止读到这是因为这些价格是通过AJAX调用加载的。这是我的代码：

import requests
import re

from bs4 import BeautifulSoup

url = "http://www.booking.com/searchresults.pl.html"
payload = {
'ss':'Warszawa', 
'si':'ai,co,ci,re,di',
'dest_type':'city',
'dest_id':'-534433',
'checkin_monthday':'25',
'checkin_year_month':'2015-10',
'checkout_monthday':'26',
'checkout_year_month':'2015-10',
'sb_travel_purpose':'leisure',
'src':'index',
'nflt':'',
'ss_raw':'',
'dcid':'4'
}

r = requests.post(url, payload)
html = r.content
parsed_html = BeautifulSoup(html, "html.parser")

print parsed_html.head.find('title').text

tables = parsed_html.find_all("table", {"class" : "sr_item_legacy"})

print "Found %s records." % len(tables)

with open("requests_results.html", "w") as f:
    f.write(r.content)

for table in tables:
    name = table.find("a", {"class" : "hotel_name_link url"})
    average = table.find("span", {"class" : "average"})
    price = table.find("strong", {"class" : re.compile(r".*\bprice scarcity_color\b.*")})
    print name.text + " " + average.text + " " + price.text

使用Chrome中的Developers Tools我注意到该网页会发送包含所有数据（包括价格）的原始回复。在处理其中一个标签的响应内容之后，有价格的原始值，为什么我无法使用我的脚本检索它们，如何解决？

Answer 1

第一个问题是网站格式不正确：在您的表格中打开了一个div，并关闭了em。因此html.parser无法找到包含价格的strong标记。您可以通过安装和使用lxml：

来解决此问题

parsed_html = BeautifulSoup(html, "lxml")

第二个问题出在你的正则表达式中。它找不到任何东西。将其更改为以下内容：

price = table.find("strong", {"class" : re.compile(r".*\bscarcity_color\b.*")})

现在你会找到价格。但是，某些条目不包含任何价格，因此您的print语句将引发错误。要解决此问题，您可以将print更改为以下内容：

print name.text, average.text, price.text if price else 'No price found'

请注意，您可以在Python中使用逗号（,）分隔要打印的字段，这样您就不需要将它们与+ " " +连接起来。

使用Python来针对AJAX请求刮取booking.com

1 个答案: