Beautifulsoup Webscraping:如何使用javascript获取信息?

时间:2018-09-07 21:00:39

标签: html python-3.x beautifulsoup

我正在尝试从Choice Hotel网站(特别是https://www.choicehotels.com/tennessee/nashville/hotels)上抓取特定页面,以创建一个田纳西州纳什维尔所有选择酒店的列表。当我打开页面并打开开发人员的工具时,可以在<div class="list">下看到正在寻找的信息,但是,当我尝试抓取网站时,找不到此标记。我似乎找不到比<div class="animate-fade z-index-90">更深的内容,任何比其更深的标签都返回“ None”。但是,我的确看到了很多Javascript。我相信这是由于在浏览器中打开页面时请求看不到所看到的内容。我如何使我的程序能够看到我看到的标签?

这是我尝试刮擦的方式:

from bs4 import BeautifulSoup
import csv

source = request.get("https://www.choicehotels.com/tennessee/nashville/hotels").text
soup = BeautifulSoup(source, 'lxml')
list = soup.find('div', class_='list')
print(list)

我有没有在做的事或做错了吗?

2 个答案:

答案 0 :(得分:2)

您可以使用POST请求直接访问JavaScript访问页面。它将返回一个JSON对象,您可以使用该对象解析任何JSON的方式。

import requests

data = {'adults':   '1',
'checkInDate':  '2018-09-08',
'checkOutDate': '2018-09-09',
'hotelSortOrder':   'RELEVANCE',
'include':  'amenity_groups, amenity_totals, rating, relative_media',
'lat':  '36.167839',
'lon':  '-86.77816',
'minors':   '0',
'optimizeResponse': 'image_url',
'placeId':  '414666',
'placeName':    'Nashville, TN, US',
'placeType':    'City',
'platformType': 'DESKTOP',
'preferredLocaleCode':  'en-us',
'ratePlanCode': 'RACK',
'ratePlans':    'RACK,PREPD,PROMO,FENCD',
'rateType': 'LOW_ALL',
'searchRadius': '25',
'siteOpRelevanceSortMethod':    'ALGORITHM_B',}

r = requests.post('https://www.choicehotels.com/webapi/location/hotels', data = data)

for h in r.json()['hotels']:
    print(h['name'])
    print (h['description'])

输出:

Comfort Inn Downtown Nashville-Vanderbilt
Get rested and ready for anything when you stay at the Comfort Inn Downtown Nashville-Vanderbilt hotel in Nashville, TN. We are merely minutes from the Nashville International Airport and conveniently located near Vanderbilt University and the Nashville Convention Center. Each comfortable room is furnished with a flat-screen TV, hair dryer, coffee maker, microwave and more. We also offer free WiFi, a fitness center and outdoor pool. Get going with a free hot breakfast including eggs, waffles and meat plus healthy options like yogurt and fresh fruit. Also, earn rewards including free nights and gift cards with our Choice Privileges Rewards program. 
Comfort Suites Airport
Get more of the space you need to spread out, relax or work at the smoke-free Comfort Suites Airport hotel in Nashville, TN, located near the Grand Ole Opry. Nearby attractions include Opry Mills, Ryman Auditorium, Music City Bowl and Music City Center. Nashville Convention Center, Sommet Center, BridgestoneFirestone and Antique Archaeology are also close. Enjoy free hot breakfast, free WiFi, free airport transportation, fitness center and a seasonal outdoor pool. Your spacious room includes a flat-screen TV, hair dryer, sofa sleeper, microwave and refrigerator. Also, earn rewards including free nights and gift cards with our Choice Privileges Rewards program. 
Clarion Hotel Nashville Downtown - Stadium
Get more value at the 100 percent smoke-free Clarion Hotel Nashville Downtown-Stadium in Nashville, TN. We are near Nissan Stadium, Country Music Hall of Fame, Ryman Auditorium, Vanderbilt University and Bridgestone Arena. Life is better when you get together--enjoy such amenities as free WiFi, ample free parking, free breakfast, free downtown shuttle, business and fitness centers and restaurant. Your guest room features a refrigerator, microwave, coffee maker, hair dryer, iron and ironing board. Also, earn rewards including free nights and gift cards with our Choice Privileges Rewards program.  CC required at check-in. Shuttle runs from 8 am-9 pm on the hour. 
The Capitol Hotel Downtown, an Ascend Hotel Collection Member
Let the destination reach you at The Capitol Hotel Downtown, an Ascend Hotel Collection Member in Nashville, TN. Our smoke-free, upscale property is conveniently located near many key performing arts and sports facilities for which this iconic city is known. All guestrooms include coffee makers, hair dryers, irons and ironing boards, desks, safes, refrigerators and more. Enjoy free breakfast, free WiFi, a fitness center and business center. Then, relax in our bar and bistro at the end of your day. Also, earn rewards including free nights and gift cards with our Choice Privileges Rewards program. 
Sleep Inn
The Sleep Inn hotel in Nashville, TN will give you a simply stylish experience. Were close to attractions like the the Grand Ole Opry, Nashville Convention Center, Opry Mills and the Sommet Center. Enjoy free breakfast, free WiFi, free weekday newspaper, a seasonal outdoor pool and guest laundry facilities. Your guest room offers warm, modern designs, and includes a flat-screen TV in addition to standard room amenities. Some rooms have microwaves, refrigerators, coffee makers, irons and ironing boards. Also, earn rewards including free nights and gift cards with our Choice Privileges Rewards program. 

答案 1 :(得分:1)

您必须处理JavaScript,可以使用硒来处理JS。请参见下面的代码。

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get("https://www.choicehotels.com/tennessee/nashville/hotels")
wait(driver, 10).until(EC.visibility_of_element_located(
        (By.XPATH, '//*[@class="address"]')))
source = driver.page_source
soup = BeautifulSoup(source, 'lxml')
list = soup.find('div', class_='list')
print(list)
driver.close()