Python从Tripadvisor抓取``要做的事情''

时间:2018-11-23 20:58:46

标签: python web-scraping beautifulsoup tripadvisor

this页上,我要抓取“在迈阿密可以做的事情类型”列表(您可以在页面末尾找到它)。这是我到目前为止的内容:

import requests
from bs4 import BeautifulSoup

# Define header to prevent errors
user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"

headers = {'User-Agent': user_agent}

new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
# Get response from url
response = requests.get(new_url, headers = headers)
# Encode response for parsing
html = response.text.encode('utf-8')
# Soupify response
soup = BeautifulSoup(html, "lxml")

tag_elements = soup.findAll("a", {"class":"attractions-attraction-overview-main-Pill__pill--23S2Q"})

# Iterate over tag_elements and exctract strings
tags_list = []
for i in tag_elements:
    tags_list.append(i.string)

问题是,我得到了类似'Good for Couples (201)', 'Good for Big Groups (130)', 'Good for Kids (100)'的值,这些值来自页面“事物类型...”部分下面的页面的“迈阿密常见搜索”区域。我也没有得到我需要的一些值,例如"Traveler Resources (7)", "Day Trips (7)"等。这两个列表的类名“要做的事情...”和“常用搜索...”都是相同的,我猜这可能是在soup.findAll()中使用类的原因。正确的方法是什么?我还应该采取其他方法吗?

4 个答案:

答案 0 :(得分:3)

这在浏览器中非常简单:

filters = driver.execute_script("return [...document.querySelectorAll('.filterName a')].map(a => a.innerText)")

答案 1 :(得分:2)

我认为您需要能够单击更多显示以查看所有可用内容。因此,请使用诸如硒之类的东西。这包括等待以确保所有元素都存在并且下拉菜单可以单击。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

d = webdriver.Chrome()
d.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")
WebDriverWait(d,5).until(EC.visibility_of_element_located((By.CSS_SELECTOR,".filter_list_0 div a")))
WebDriverWait(d, 5).until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#taplc_attraction_filters_clarity_0 span.ui_icon.caret-down"))).click()
tag_elements = WebDriverWait(d,5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".filter_list_0 div a")))
tags_list = [i.text for i in tag_elements]
print(tags_list)
d.quit()

enter image description here


没有硒,我只能得到15个项目

import requests
from bs4 import BeautifulSoup

user_agent = "Mozilla/44.0.2 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/9.0.2"
headers = {'User-Agent': user_agent}
new_url = "https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html"
response = requests.get(new_url, headers = headers)
soup = BeautifulSoup(response.content, "lxml")
tag_elements = soup.select('#component_3 > div > div > div:nth-of-type(12) > div:nth-of-type(1) > div > div a')

tags_list = [i.text for i in tag_elements]       
print(tags_list)

答案 2 :(得分:2)

看起来您需要使用硒。问题是下拉菜单在您单击后才显示其余选项。

let dns = "185.136.234.36"

答案 3 :(得分:1)

仅获取 Types of Things to Do in Miami 标头中的内容有些棘手。为此,您需要像下面所做的那样以有组织的方式定义选择器。以下脚本应在上述标题下单击See all按钮。启动点击后,脚本将解析您要查找的相关内容:

from selenium import webdriver
from selenium.webdriver.support import ui
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
wait = ui.WebDriverWait(driver, 10)
driver.get("https://www.tripadvisor.com/Attractions-g34438-Activities-Miami_Florida.html")

show_more = wait.until(lambda driver: driver.find_element_by_css_selector("[class='ui_container'] div:nth-of-type(1) .caret-down"))
driver.execute_script("arguments[0].click();",show_more)
soup = BeautifulSoup(driver.page_source,"lxml")
items = [item.text for item in soup.select("[class='ui_container'] div:nth-of-type(1) a[href^='/Attractions-']")]
print(items)   
driver.quit()

它产生的输出:

['Tours (277)', 'Outdoor Activities (255)', 'Boat Tours & Water Sports (184)', 'Shopping (126)', 'Nightlife (126)', 'Spas & Wellness (109)', 'Fun & Games (67)', 'Transportation (66)', 'Museums (61)', 'Sights & Landmarks (54)', 'Nature & Parks (54)', 'Food & Drink (27)', 'Concerts & Shows (25)', 'Classes & Workshops (22)', 'Zoos & Aquariums (7)', 'Traveler Resources (7)', 'Day Trips (7)', 'Water & Amusement Parks (5)', 'Casinos & Gambling (3)', 'Events (2)']