我尝试使用BeautifulSoup进行抓取,但返回[]
。然后,当我尝试查看源代码时,可以使用div class="loading32"
。
你如何刮掉这种元素?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = productUrl # bs4 part
uClient = uReq(my_url) # bs4 part
page_html = uClient.read() # bs4 part
uClient.close() # bs4 part
page_soup = soup(page_html, "html.parser") # bs4 part
description = page_soup.findAll("div", {"class": "ui-box product-description-main"})
string4 = str(description)
<div class="ui-box product-description-main" id="j-product-description">
<div class="ui-box-title">Product Description</div>
<div class="ui-box-body">
<div class="description-content" data-role="description" data-spm="1000023">
<div class="loading32"></div>
</div>
</div>
</div>
答案 0 :(得分:1)
所以这里的问题是,这些loading32
元素是通过客户端编译的javascript
加载的。这是Splash
的理想用例。 ScrapingHub
有renderer
这个curl API
可以通过Lua
使用,你可以执行一些Splash
代码也可以帮助你避免很多问题,比如js触发页面加载,等待,点击和诸如此类的。
此外,您可以将此Scrapy
与jetbrains.exodus.log.Log.invalidateSharedCache()
进行整合,非常棒。
答案 1 :(得分:0)
信息就在那里,它不需要使用javascript。您只需要查看返回的HTML并确定提取所需项目的最佳方法。我猜你可能会尝试得到以下内容:
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup
my_url = 'https://www.aliexpress.com/store/product/100-Original-16-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4/1053031_32657797704.html?spm=2114.12010608.0.0.22e12d66I7a3Dp'
uClient = uReq(my_url) # bs4 part
page_html = uClient.read() # bs4 part
uClient.close() # bs4 part
soup = BeautifulSoup(page_html, "html.parser") # bs4 part
details = {}
details['Product Name'] = soup.find('h1', class_='product-name').text
details['Price Range'] = soup.find('div', class_='p-current-price').find_all('span')[1].text
item_specifics = soup.find('ul', class_='product-property-list util-clearfix')
for li in item_specifics.find_all('li'):
entry = li.get_text(strip=True).split(':')
details[entry[0]] = ', '.join(entry[1:])
# Locate the image
li = soup.find('div', class_='ui-image-viewer-thumb-wrap')
url = li.img['src']
details['Image URL'] = url
details['Image Filename'] = url.rsplit('/', 1)[1]
for item, desc in details.items():
print('{:30} {}'.format(item, desc))
会提供以下信息:
Product Name Original 2016 Shimano Casitas 150 151 150hg 151hg Right Left Hand Baitcasting Fishing Reel 4+1BB 5.5kg SVS Infinity fishing reel
Price Range 83.60 - 85.60
Fishing Method Bait Casting
Baits Type Fake Bait
Position Ocean Rock Fshing,River,Stream,Reservoir Pond,Ocean Beach Fishing,Lake,Ocean Boat Fishing
Fishing Reels Type Baitcast Reel
Model Number Casitas
Brand Name Shimano
Ball Bearings 4+1BB
Feature 1 Shimano Stable Spool S3D
Feature 2 SVS Infinity Brake System (Infinite Cast Control)
Model 150/ 151/ 150HG/ 151HG
PE Line (50 test /m) 20-150/30-135/ 40-105
Nylon Line (51hg test /m) 10-120/12-110/14-90
Weight 190g
Gear Ratio 6.3, 1 / 7.2, 1
Made in Malaysia
Image URL https://ae01.alicdn.com/kf/HTB1qRKzJFXXXXboXVXXq6xXFXXXU/Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
Image Filename Original-2016-Shimano-Casitas-150-151-150hg-151hg-Right-Left-Hand-Baitcasting-Fishing-Reel-4-1BB.jpg_640x640.jpg
还存储图像信息。然后可以使用另一个uReq
调用下载,并使用获取的文件名将数据作为二进制文件保存到文件中。