无法从BeautifulSoup页面获取实际标记

时间:2014-11-25 19:01:16

标签: python selenium python-3.x web-scraping beautifulsoup

我正在尝试使用BeautifulSoupSelinium

的组合来抓取此网址
http://starwood.ugc.bazaarvoice.com/3523si-en_us/115/reviews.djs?format=embeddedhtml&page=2&scrollToTop=true

我试过这段代码

active_review_page_html  = browser.page_source
active_review_page_html = active_review_page_html.replace('\\', "")
hotel_page_soup = BeautifulSoup(active_review_page_html)
print(hotel_page_soup)

但它是什么让我返回像

这样的数据
;<span class="BVRRReviewText">Hotel accommodations and staff were fine ....

但我必须用

从那个页面刮掉那个范围
for review_div in hotel_page_soup.select("span .BVRRReviewText"):

如何从该网址获得真正的标记?

1 个答案:

答案 0 :(得分:1)

首先,你给我们错误的链接,而不是你试图抓取的actual page,你给我们一个参与页面加载js文件的链接,这将是一个不必要的挑战解析。

其次,在这种情况下,您不需要BeautifulSoupselenium本身擅长定位元素并提取文本或属性。这里不需要额外的步骤。

以下是使用您想要获得的评论的实际页面的工作示例:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()  # or webdriver.Firefox()
driver.get('http://www.starwoodhotels.com/sheraton/property/reviews/index.html?propertyID=115&language=en_US')

# wait for the reviews to load
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "span.BVRRReviewText")))

# get reviews
for review_div in driver.find_elements_by_css_selector("span.BVRRReviewText"):
    print(review_div.text)
    print("---")

driver.close()

打印:

This is not a low budget hotel . Yet the hotel offers no amenities. Nothing and no WiFi. In fact, you block the wifi that comes with my celluar plan. I am a part of 2 groups that are loyal to the Sheraton, Alabama A&M and the 9th Episcopal District AMEChurch but the Sheraton is not loyal to us.
---
We are a company that had (5) guest rooms at the hotel. Despite having a credit card on file for room and tax charges, my guest was charged the entire amount to her personal credit card. It has taken me (5) PHONE CALLS and my own time and energy to get this bill reversed. I guess leaving a message with information and a phone number numerous times is IGNORED at this hotel. You can guarantee that we will not return with our business. YOu may thank Kimerlin or Kimberly in your accounting office for her lack of personal service and follow through for the lost business in the future.
---
...

我故意让你处理分页 - 请告诉我你是否遇到困难。