我目前正在使用python和漂亮的汤学习基本的网络抓取。我做了一些东西在我的笔记本Jupyter和它的工作,但是当我在终端运行从.py文件相同的代码,BeautifulSoup似乎并没有被正确解析,并没有得到打印出来。
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path="/Users/Shiva/Downloads/chromedriver")
driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')
load_all_flights = driver.find_element_by_xpath('//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]')
load_all_flights.click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")
for trip in info:
price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
if price == None:
price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
if type_of_flight == None:
type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
print(str(type_of_flight.text).strip() + " : " + str(price.text).strip())
在jupyter记事本,我得到飞行的种类和价格“不停:$ 500强”的名单
但是它在终端中不起作用,因为“ info”变量是一个空列表
答案 0 :(得分:0)
您需要等待页面呈现。 Jupyter获取数据的原因是在解析页面之前,它足够慢(或者您具有不同的单元格)才能呈现页面。以下应该可以解决问题:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd
driver = webdriver.Chrome(executable_path="C:\\Users\\Andersen\\Desktop\\Tools\\chromedriver.exe")
driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')
xpath = '//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]'
wait = WebDriverWait(driver, 10)
confirm = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))
load_all_flights = driver.find_element_by_xpath(xpath)
load_all_flights.click()
soup = BeautifulSoup(driver.page_source, 'html.parser')
info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")
for trip in info:
price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
if price == None:
price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
if type_of_flight == None:
type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
print(str(type_of_flight.text).strip() + " : " + str(price.text).strip())
输出(截至2019年2月2日):
2 stops : $588
Nonstop : $749
Nonstop : $749
1 stop : $866
2 stops : $1,271
2 stops : $1,294
2 stops : $1,294
2 stops : $1,805
2 stops : $1,805