Question

我正在尝试学习使用Python库BeautifulSoup，例如，我想在Google Flights上抓取某个航班的价格。因此，我以例如minimize的价格连接到Google Flights，并且希望获得最便宜的航班价格。

因此，我将使用此类“ gws-flights-results__itinerary-price”（如图所示）在div中获取值。

figure example

这是我写的简单代码：

add_filter( 'woocommerce_get_price_html', 'prepend_text_to_product_price', 20, 2 );

function prepend_text_to_product_price( $price_html, $product ) {
    // Only on frontend and excluding minmax prices on variable products

    if( is_admin() || $product->is_type('variable') ) 
        return $price_html;

    // Get product price
    $price = (float) $product->get_price();  // Regular price

    if( $price > 15 )
        $price_html = '<span>'.__('your text', 'woocommerce' ).'<span> '.$price_html;

    return $price_html;
}

但是结果div的类为from bs4 import BeautifulSoup import urllib.request url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o' page = urllib.request.urlopen(url) soup = BeautifulSoup(page, 'html.parser') div = soup.find('div', attrs={'class': 'gws-flights-results__itinerary-price'})。

我也尝试

NoneType

但是在以这种方式找到的所有div中，没有我感兴趣的div。有人可以帮我吗？

Answer 1

可能需要运行javascript，因此请使用诸如硒之类的方法

from selenium import webdriver
url = 'https://www.google.com/flights?hl=it#flt=/m/07_pf./m/05qtj.2019-04-27;c:EUR;e:1;sd:1;t:f;tt:o'
driver = webdriver.Chrome()
driver.get(url)
print(driver.find_element_by_css_selector('.gws-flights-results__cheapest-price').text)
driver.quit()

Answer 2

您正在学习网页抓取真是太好了！结果得到NoneType的原因是，您要抓取的网站动态加载内容。当请求库获取网址时，它仅包含javascript。而具有此类“ gws-flights-results__itinerary-price”的div尚未呈现！因此，您用来抓取此网站的抓取方法将无法实现。

不过，您可以使用其他方法，例如使用诸如selenium或splash这样的工具来获取页面以呈现javascript，然后解析内容。

Answer 3

BeautifulSoup是用于提取部分HTML或XML的好工具，但是在这里看起来您只需要将URL转到另一个JSON对象的GET请求即可。

（我现在不在电脑旁，明天可以举一个例子。）

网络抓取谷歌航班价格

3 个答案: