BeautifulSoup-Python:你如何抓取尚未加载的数据?

时间:2018-06-09 14:45:04

标签: python beautifulsoup

我尝试使用BeautifulSoup进行抓取,但返回[]。然后,当我尝试查看源代码时,可以使用div class="loading32"

你如何刮掉这种元素?

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = productUrl  # bs4 part
uClient = uReq(my_url)  # bs4 part
page_html = uClient.read()  # bs4 part
uClient.close()  # bs4 part
page_soup = soup(page_html, "html.parser")  # bs4 part
description = page_soup.findAll("div", {"class": "ui-box product-description-main"})
string4 = str(description)

网址:https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp

<div class="ui-box product-description-main" id="j-product-description">
        <div class="ui-box-title">Product Description</div>
        <div class="ui-box-body">

            <div class="description-content" data-role="description" data-spm="1000023">
            <div class="loading32"></div>
            </div>

        </div>
    </div>

2 个答案:

答案 0 :(得分:1)

所以这里的问题是,这些loading32元素是通过客户端编译的javascript加载的。这是Splash的理想用例。 ScrapingHubrenderer这个curl API可以通过Lua使用,你可以执行一些Splash代码也可以帮助你避免很多问题,比如js触发页面加载,等待,点击和诸如此类的。

链接:Splash Documentation

此外,您可以将此Scrapyjetbrains.exodus.log.Log.invalidateSharedCache()进行整合,非常棒。

链接:Scrapy Splash Github

答案 1 :(得分:0)

信息就在那里,它不需要使用javascript。您只需要查看返回的HTML并确定提取所需项目的最佳方法。我猜你可能会尝试得到以下内容:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup 

my_url = 'https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp'
uClient = uReq(my_url)  # bs4 part
page_html = uClient.read()  # bs4 part
uClient.close()  # bs4 part

soup = BeautifulSoup(page_html, "html.parser")  # bs4 part

details = {}
details['Product Name'] = soup.find('h1', class_='product-name').text
details['Price Range'] = soup.find('div', class_='p-current-price').find_all('span')[1].text

item_specifics = soup.find('ul', class_='product-property-list util-clearfix')
for li in item_specifics.find_all('li'):
    entry = li.get_text(strip=True).split(':')
    details[entry[0]] = ', '.join(entry[1:])

# Locate the image    
li = soup.find('div', class_='ui-image-viewer-thumb-wrap')
url = li.img['src']
details['Image URL'] = url
details['Image Filename'] = url.rsplit('/', 1)[1]

for item, desc in details.items():
    print('{:30} {}'.format(item, desc))

会提供以下信息:

Product Name                   Original 2016 Shimano Casitas 150 151 150hg 151hg Right Left Hand Baitcasting Fishing Reel 4+1BB 5.5kg SVS Infinity fishing reel
Price Range                    83.60 - 85.60
Fishing Method                 Bait Casting
Baits Type                     Fake Bait
Position                       Ocean Rock Fshing,River,Stream,Reservoir Pond,Ocean Beach Fishing,Lake,Ocean Boat Fishing
Fishing Reels Type             Baitcast Reel
Model Number                   Casitas
Brand Name                     Shimano
Ball Bearings                  4+1BB
Feature 1                      Shimano Stable Spool S3D
Feature 2                      SVS Infinity Brake System (Infinite Cast Control)
Model                          150/ 151/ 150HG/ 151HG
PE Line (50 test /m)           20-150/30-135/ 40-105
Nylon Line (51hg test /m)      10-120/12-110/14-90
Weight                         190g
Gear Ratio                     6.3, 1 / 7.2, 1
Made in                        Malaysia
Image URL                      https://ae01.alicdn.com/kf/HTB1qRKzJFXXXXboXVXXq6xXFXXXU/Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
Image Filename                 Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg

还存储图像信息。然后可以使用另一个uReq调用下载,并使用获取的文件名将数据作为二进制文件保存到文件中。