我正试图在TripAdvisor上刮擦航空公司的评论,特别是使用requests
和BeautifulSoup
。但是,当我将BeautifulSoup
应用于请求的结果时,无法获取页面的源代码。相反,似乎我只获得了部分源代码。有某种保护吗?我的代码中有什么错误吗?
这是我的代码:
#%% Libraries and other basic inputs
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
#'User-Agent': '*',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
}
s = requests.Session()
s.headers.update(headers)
url = "https://www.tripadvisor.com/Airline_Review-d8728987-Reviews-or25-Aeroflot#REVIEWS"
r = s.get(url,allow_redirects=False)
print(r.status_code) # I get status 200
soup = BeautifulSoup(r.text, 'html.parser')
print('\n Body = ',soup.find(class_="location-review-review-list-parts-ExpandableReview__reviewText--gOmRC")) # Example trying to find the body of a review ; the element is actually in the source code but not in soup ; returns None
我的soup
的源代码和内容都超过2000行,这就是为什么我不在此处发布它们的原因。
答案 0 :(得分:1)
使用JavaScript使用XMLHttpRequest(XHR)呈现页面。由于没有JavaScript引擎,因此请求无法使用XHR。 您可以使用Selenium或其他技术。 使用硒
sudo pip3 install selenium
然后获取驱动程序,例如https://sites.google.com/a/chromium.org/chromedriver/downloads
代码类似于:
from selenium import webdriver
from bs4 import BeautifulSoup
import re
import time
driver = webdriver.Chrome()
url = "https://www.tripadvisor.com/Airline_Review-d8728987-Reviews-or25-Aeroflot#REVIEWS"
driver.get(url)
time.sleep(10)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
for reviewTitleText in soup.find_all('a', {"class": re.compile("^location-review-review-list-parts-ReviewTitle__reviewTitleText")}):
print(reviewTitleText.text)
输出:
Just don’t. And the airport sucks.
Nightmare for Transit flight
Interesting flight
Great flight experience -- Moscow airport understaffed and overworked
Comfortable trip for a business flight
如果您在Windows上,则必须在webdriver.Chrome()中提供驱动器的路径
我还使用了正则表达式作为类名,因为每个页面请求都会改变它们。