Question

我正在尝试从以下网址获取价格数据。但是我似乎只能将文本从'div's降到一定水平，这是我的代码：

from selenium import webdriver
from bs4 import BeautifulSoup

def scrape_flight_prices(URL):

    browser = webdriver.PhantomJS()
    # PARSE THE HTML
    browser.get(URL)
    soup = BeautifulSoup(browser.page_source, "lxml")
    page_divs = soup.findAll("div", attrs={'id':'app-root'}) 
    for p in page_divs:
        print(p)

if __name__ == '__main__':
  URL1="https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results"

这是输出：

<div id="app-root">
<section class="day-content state-loading state-no-results" id="daysection">
<div class="day-searching">
<div class="hot-spinner medium"></div>
<div class="day-searching-message">Searching</div>
</div>
</section>
</div>

我要从中抓取的html部分如下所示：

https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results

但是，当我尝试使用以下代码进行抓取时：

prices = soup.findAll("a", attrs={'target':"_blank", "data-e2e":"itinerary-price", "class":"CTASection__price-2bc7h price"})  
for p in prices:
    print(p)

它什么也不打印！我怀疑js脚本正在运行某些东西来生成其余的代码和/或数据？谁能帮我提取数据？具体来说，我试图获取价格，航班时间，航空公司名称等信息，但是如果漂亮的汤没有从页面上打印相关的html，那么我不确定如何获得它？

不胜感激任何指点！提前非常感谢！

Answer 1

尝试以下代码以获取价格：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

prices = [price.text for price in wait(browser, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "price")))]
print(prices)

python从skyscanner抓取航班和价格数据

1 个答案: