Python请求无法在Trip Advisor上获取源代码

时间:2019-09-10 17:08:35

标签: web-scraping beautifulsoup python-requests

我正试图在TripAdvisor上刮擦航空公司的评论,特别是使用requestsBeautifulSoup。但是,当我将BeautifulSoup应用于请求的结果时,无法获取页面的源代码。相反,似乎我只获得了部分源代码。有某种保护吗?我的代码中有什么错误吗?

这是我的代码:

#%% Libraries and other basic inputs
import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
        #'User-Agent': '*',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}

s = requests.Session()
s.headers.update(headers)

url = "https://www.tripadvisor.com/Airline_Review-d8728987-Reviews-or25-Aeroflot#REVIEWS"
r = s.get(url,allow_redirects=False)

print(r.status_code) # I get status 200

soup = BeautifulSoup(r.text, 'html.parser')

print('\n Body = ',soup.find(class_="location-review-review-list-parts-ExpandableReview__reviewText--gOmRC")) # Example trying to find the body of a review ; the element is actually in the source code but not in soup ; returns None

我的soup的源代码和内容都超过2000行,这就是为什么我不在此处发布它们的原因。

1 个答案:

答案 0 :(得分:1)

使用JavaScript使用XMLHttpRequest(XHR)呈现页面。由于没有JavaScript引擎,因此请求无法使用XHR。 您可以使用Selenium或其他技术。 使用硒

sudo pip3 install selenium

然后获取驱动程序,例如https://sites.google.com/a/chromium.org/chromedriver/downloads

代码类似于:

from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time

driver = webdriver.Chrome()

url = "https://www.tripadvisor.com/Airline_Review-d8728987-Reviews-or25-Aeroflot#REVIEWS"
driver.get(url)

time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
for reviewTitleText in soup.find_all('a', {"class": re.compile("^location-review-review-list-parts-ReviewTitle__reviewTitleText")}):
    print(reviewTitleText.text)

输出:

Just don’t. And the airport sucks.
Nightmare for Transit flight
Interesting flight
Great flight experience -- Moscow airport understaffed and overworked
Comfortable trip for a business flight

如果您在Windows上,则必须在webdriver.Chrome()中提供驱动器的路径

我还使用了正则表达式作为类名,因为每个页面请求都会改变它们。