我是Python BeautifulSoup的菜鸟,实际上我根本就不是Python的菜鸟。我试图将这个网页(https://th.jbl.com/bluetooth-portables)当作自学练习。我想抓取每种产品的SKU信息:确切地说,类别,子类别,名称和productID。我设法抓取了类别,子类别和大多数产品名称以及它们的productID,但是您可能会看到此页面中的产品清单不一致,并且我只是反复地停留在产品上:Xtreme 2没有产品ID。在这一阶段,刮擦几乎停止了,而仍有3个需要刮擦的瓷砖。
我尽力了,但是无法通过。实际上,Xtreme 2仅带有名称打印,而没有信息写入该文件。
我的方法是刮掉包含产品的所有图块,而忽略其余的图块。以下是代码:
from bs4 import BeautifulSoup
import requests, csv, codecs, json
jbl_file = codecs.open('JBL_TH.csv','w',encoding='utf_8_sig')
csv_writer = csv.writer(jbl_file)
csv_writer.writerow(['Category','Subcategory','Product','Prod_ID'])
url = 'https://th.jbl.com/bluetooth-portables'
source = requests.get(url).content
soup = BeautifulSoup(source,'lxml')
home = soup.find('div',class_='breadcrumb clearfix')
# find the category and subcategory
category = home.find('h2',property='name').text
subcategory = home.find('span',class_='breadcrumb-last',property='name').text
# define the scope of product container
all = soup.find('div',id='search-result-items')
# each product is in a tile, so find all the tiles
for tile in all.find_all('div',class_='product-tile'):
print(category)
print(subcategory)
# the tile may not contain product, hence steps to ignore such tile by try ... except ...
try:
# find out the name of the product in each tile
name = tile.find('a',class_='productname-link')['title']
print(name)
# even if product exists in the tile, SKU ID may not exist in class product-swtches,
# hence set to ignore by try ... except ...
try:
# SKU ID is in the product-swatches class
directory = tile.find('div',class_='product-swatches')
# each product may contain multiple SKU ID, each in one sawtch-data,
# hence using another for loop
for colour in directory.find_all('div', class_='swatch-data'):
product_id = json.loads(colour.text)['productID']
print(product_id)
# set to write the found category, subcategory, name, and product_id to the csv file
csv_writer.writerow([category,subcategory,name,product_id])
except:
name = 'dummy'
product_id = 'dummy'
print(name)
print(product_id)
# if do not find them, use dummy as name and product_id
csv_writer.writerow([category,subcategory,name,product_id])
except:
name = 'dummy'
product_id = 'dummy'
print(name)
print(product_id)
# if do not find them, use dummy as name an
csv_writer.writerow([category,subcategory,name,product_id])
jbl_file.close()
我还附上了两个屏幕截图,其中详细说明了我要抓取的信息。有人可以帮助我吗?
谢谢。
答案 0 :(得分:0)
那是整个页面。该页面仅包含少量产品,当您在浏览器中向下滚动页面时,将使用javascript加载更多产品。
由于BeautifulSoup无法执行javascript,因此您必须使用其他工具(例如Selenium),或者尝试模仿javascript自己所做的事情。
通过在浏览器中使用开发人员网络检查器(按Firefox中的F12键),我可以看到向下滚动后,浏览器将向URL https://th.jbl.com/bluetooth-portables?prefn1=isAvailabilityforLocale&prefn2=isRefurbished&prefv3=false&sz=12&start=12&format=page-element&prefv1=yes&prefv2=false&prefn3=isSupport
发出请求。
如您所见,它使用start=12
参数来定义它想要从位置12开始的结果-看来您可以操纵该参数来获取所需的数据。
n = 0
while True:
n = n + 12
new_url = 'https://th.jbl.com/bluetooth-portables?sz=12&start={}'.format(n)
# fetch new_url and repeat parsing...
答案 1 :(得分:-1)
您可以使用getattr
:
import requests, json
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://th.jbl.com/bluetooth-portables').text, 'html.parser')
category = d.find('h2', {'property':'name'}).text
subcategory = d.find('span', {'class':'breadcrumb-last'}).text
products = [i for i in d.find_all('div', {'class':'product-tile'})]
headers = [['a', {'class':'productname-link'}], ['div', {'class':'swatch-data'}]]
new_products = [[getattr(i, 'find', lambda *_:None)(*b) for b in headers] for i in products]
final_products = [[getattr(a, 'attrs', lambda x:{'title':None}[x]).get('title'), json.loads(getattr(b, 'text', '{}')).get('productID')] for a, b in new_products]
输出:
Wireless
Portables
[['JBL Flip 4', 'JBLFLIP4BLKAM'], [None, None], ['JBL Flip 3', 'JBLFLIP3BLK'], ['JBL Flip 3 Special Edition', 'JBLFLIP3MALTA'], ['JBL Charge 3', 'JBLCHARGE3BLKAS'], ['JBL Charge 3 Special Edition', 'JBLCHARGE3MOSAICAS'], ['JBL Clip 2', 'JBLCLIP2BLK'], ['JBL Clip 2 Special Edition', 'JBLCLIP2MALTA'], ['JBL Pulse 2', 'JBLPULSE2BLKAS'], ['JBL GO', 'JBLGOBLK'], ['JBL Xtreme', 'JBLXTREMEBLKAS'], ['JBL Xtreme Special Edition', 'JBLXTREMESQUADAS'], ['Xtreme 2', None]]
要写入csv
文件,可以使用以下命令:
import csv
with open('product_listings.csv', 'w') as f:
write = csv.writer(f)
write.writerows([['category', 'subcategory', 'title', 'id'], [category, subcategory]+(['']*2)]+[(['']*2)+['' if not i else i for i in b] for b in final_products if any(b)])
输出:
category,subcategory,title,id
Wireless,Portables,,
,,JBL Flip 4,JBLFLIP4BLKAM
,,JBL Flip 3,JBLFLIP3BLK
,,JBL Flip 3 Special Edition,JBLFLIP3MALTA
,,JBL Charge 3,JBLCHARGE3BLKAS
,,JBL Charge 3 Special Edition,JBLCHARGE3MOSAICAS
,,JBL Clip 2,JBLCLIP2BLK
,,JBL Clip 2 Special Edition,JBLCLIP2MALTA
,,JBL Pulse 2,JBLPULSE2BLKAS
,,JBL GO,JBLGOBLK
,,JBL Xtreme,JBLXTREMEBLKAS
,,JBL Xtreme Special Edition,JBLXTREMESQUADAS
,,Xtreme 2,
编辑:使用selenium
:
from selenium import webdriver
import time
d = webdriver.Chrome('/path/to/driver')
d.get('https://th.jbl.com/bluetooth-portables')
last_height = d.execute_script("return document.body.scrollHeight")
while True:
d.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
new_height = d.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
_d = soup(d.page_source, 'lxml')
category = _d.find('h2', {'property':'name'}).text
subcategory = _d.find('span', {'class':'breadcrumb-last'}).text
products = [i for i in _d.find_all('div', {'class':'product-tile'})]
headers = [['a', {'class':'productname-link'}], ['div', {'class':'swatch-data'}]]
new_products = [[getattr(i, 'find', lambda *_:None)(*b) for b in headers] for i in products]
final_products = [[getattr(a, 'attrs', lambda x:{'title':None}[x]).get('title'), json.loads(getattr(b, 'text', '{}')).get('productID')] for a, b in new_products]
输出:
[['JBL Flip 4', u'JBLFLIP4BLKAM'], [None, None], ['JBL Flip 3', u'JBLFLIP3BLK'], ['JBL Flip 3 Special Edition', u'JBLFLIP3MALTA'], ['JBL Charge 3', u'JBLCHARGE3BLKAS'], ['JBL Charge 3 Special Edition', u'JBLCHARGE3MOSAICAS'], ['JBL Clip 2', u'JBLCLIP2BLK'], ['JBL Clip 2 Special Edition', u'JBLCLIP2MALTA'], ['JBL Pulse 2', u'JBLPULSE2BLKAS'], ['JBL GO', u'JBLGOBLK'], ['JBL Xtreme', u'JBLXTREMEBLKAS'], ['JBL Xtreme Special Edition', u'JBLXTREMESQUADAS'], ['Xtreme 2', None], ['CLIP 3', u'JBLCLIP3BLK'], [None, None], ['JBL GO 2', u'JBLGO2BLK'], ['Pulse 3', u'JBLPULSE3BLKJN']]