试图使用BeautifulSoup4刮取一个网页,然后它只是在途中停止

时间:2018-07-10 19:29:54

标签: python web-scraping

我是Python BeautifulSoup的菜鸟,实际上我根本就不是Python的菜鸟。我试图将这个网页(https://th.jbl.com/bluetooth-portables)当作自学练习。我想抓取每种产品的SKU信息:确切地说,类别,子类别,名称和productID。我设法抓取了类别,子类别和大多数产品名称以及它们的productID,但是您可能会看到此页面中的产品清单不一致,并且我只是反复地停留在产品上:Xtreme 2没有产品ID。在这一阶段,刮擦几乎停止了,而仍有3个需要刮擦的瓷砖。

我尽力了,但是无法通过。实际上,Xtreme 2仅带有名称打印,而没有信息写入该文件。

我的方法是刮掉包含产品的所有图块,而忽略其余的图块。以下是代码:

    from bs4 import BeautifulSoup
    import requests, csv, codecs, json

    jbl_file = codecs.open('JBL_TH.csv','w',encoding='utf_8_sig')
    csv_writer = csv.writer(jbl_file)
    csv_writer.writerow(['Category','Subcategory','Product','Prod_ID'])

    url = 'https://th.jbl.com/bluetooth-portables'
    source = requests.get(url).content
    soup = BeautifulSoup(source,'lxml')
    home = soup.find('div',class_='breadcrumb clearfix')

    # find the category and subcategory
    category = home.find('h2',property='name').text
    subcategory = home.find('span',class_='breadcrumb-last',property='name').text


    # define the scope of product container
    all = soup.find('div',id='search-result-items')
    # each product is in a tile, so find all the tiles
    for tile in all.find_all('div',class_='product-tile'):
        print(category)
        print(subcategory)

        # the tile may not contain product, hence steps to ignore such tile by try ... except ...
        try:
            # find out the name of the product in each tile
            name = tile.find('a',class_='productname-link')['title']
            print(name)

            # even if product exists in the tile, SKU ID may not exist in class product-swtches,
            # hence set to ignore by try ... except ...
            try:
                # SKU ID is in the product-swatches class
                directory = tile.find('div',class_='product-swatches')
                # each product may contain multiple SKU ID, each in one sawtch-data,
                # hence using another for loop
                for colour in directory.find_all('div', class_='swatch-data'):
                    product_id = json.loads(colour.text)['productID']
                    print(product_id)
                    # set to write the found category, subcategory, name, and product_id to the csv file
                    csv_writer.writerow([category,subcategory,name,product_id])
            except:
                name = 'dummy'
                product_id = 'dummy'
                print(name)
                print(product_id)
                # if do not find them, use dummy as name and product_id
                csv_writer.writerow([category,subcategory,name,product_id])

        except:
            name = 'dummy'
            product_id = 'dummy'
            print(name)
            print(product_id)
            # if do not find them, use dummy as name an
            csv_writer.writerow([category,subcategory,name,product_id])

    jbl_file.close()

我还附上了两个屏幕截图,其中详细说明了我要抓取的信息。有人可以帮助我吗?

category and subcategory of the products product name and productID

谢谢。

2 个答案:

答案 0 :(得分:0)

那是整个页面。该页面仅包含少量产品,当您在浏览器中向下滚动页面时,将使用javascript加载更多产品。

由于BeautifulSoup无法执行javascript,因此您必须使用其他工具(例如Selenium),或者尝试模仿javascript自己所做的事情。

通过在浏览器中使用开发人员网络检查器(按Firefox中的F12键),我可以看到向下滚动后,浏览器将向URL https://th.jbl.com/bluetooth-portables?prefn1=isAvailabilityforLocale&prefn2=isRefurbished&prefv3=false&sz=12&start=12&format=page-element&prefv1=yes&prefv2=false&prefn3=isSupport发出请求。

如您所见,它使用start=12参数来定义它想要从位置12开始的结果-看来您可以操纵该参数来获取所需的数据。

n = 0
while True:
    n = n + 12
    new_url = 'https://th.jbl.com/bluetooth-portables?sz=12&start={}'.format(n)
    # fetch new_url and repeat parsing...

答案 1 :(得分:-1)

您可以使用getattr

import requests, json
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://th.jbl.com/bluetooth-portables').text, 'html.parser')
category = d.find('h2', {'property':'name'}).text
subcategory = d.find('span', {'class':'breadcrumb-last'}).text
products = [i for i in d.find_all('div', {'class':'product-tile'})]
headers = [['a', {'class':'productname-link'}], ['div', {'class':'swatch-data'}]]
new_products = [[getattr(i, 'find', lambda *_:None)(*b) for b in headers] for i in products]
final_products = [[getattr(a, 'attrs', lambda x:{'title':None}[x]).get('title'), json.loads(getattr(b, 'text', '{}')).get('productID')] for a, b in new_products]

输出:

Wireless
Portables
[['JBL Flip 4', 'JBLFLIP4BLKAM'], [None, None], ['JBL Flip 3', 'JBLFLIP3BLK'], ['JBL Flip 3 Special Edition', 'JBLFLIP3MALTA'], ['JBL Charge 3', 'JBLCHARGE3BLKAS'], ['JBL Charge 3 Special Edition', 'JBLCHARGE3MOSAICAS'], ['JBL Clip 2', 'JBLCLIP2BLK'], ['JBL Clip 2 Special Edition', 'JBLCLIP2MALTA'], ['JBL Pulse 2', 'JBLPULSE2BLKAS'], ['JBL GO', 'JBLGOBLK'], ['JBL Xtreme', 'JBLXTREMEBLKAS'], ['JBL Xtreme Special Edition', 'JBLXTREMESQUADAS'], ['Xtreme 2', None]]

要写入csv文件,可以使用以下命令:

import csv
with open('product_listings.csv', 'w') as f:
  write = csv.writer(f)
  write.writerows([['category', 'subcategory', 'title', 'id'], [category, subcategory]+(['']*2)]+[(['']*2)+['' if not i else i for i in b] for b in final_products if any(b)])

输出:

category,subcategory,title,id
Wireless,Portables,,
,,JBL Flip 4,JBLFLIP4BLKAM
,,JBL Flip 3,JBLFLIP3BLK
,,JBL Flip 3 Special Edition,JBLFLIP3MALTA
,,JBL Charge 3,JBLCHARGE3BLKAS
,,JBL Charge 3 Special Edition,JBLCHARGE3MOSAICAS
,,JBL Clip 2,JBLCLIP2BLK
,,JBL Clip 2 Special Edition,JBLCLIP2MALTA
,,JBL Pulse 2,JBLPULSE2BLKAS
,,JBL GO,JBLGOBLK
,,JBL Xtreme,JBLXTREMEBLKAS
,,JBL Xtreme Special Edition,JBLXTREMESQUADAS
,,Xtreme 2,

编辑:使用selenium

from selenium import webdriver
import time
d = webdriver.Chrome('/path/to/driver')
d.get('https://th.jbl.com/bluetooth-portables')
last_height = d.execute_script("return document.body.scrollHeight")
while True:
  d.execute_script("window.scrollTo(0, document.body.scrollHeight);")
  time.sleep(1)
  new_height = d.execute_script("return document.body.scrollHeight")
  if new_height == last_height:
    break
  last_height = new_height

_d = soup(d.page_source, 'lxml')
category = _d.find('h2', {'property':'name'}).text
subcategory = _d.find('span', {'class':'breadcrumb-last'}).text
products = [i for i in _d.find_all('div', {'class':'product-tile'})]
headers = [['a', {'class':'productname-link'}], ['div', {'class':'swatch-data'}]]
new_products = [[getattr(i, 'find', lambda *_:None)(*b) for b in headers] for i in products]
final_products = [[getattr(a, 'attrs', lambda x:{'title':None}[x]).get('title'), json.loads(getattr(b, 'text', '{}')).get('productID')] for a, b in new_products]

输出:

[['JBL Flip 4', u'JBLFLIP4BLKAM'], [None, None], ['JBL Flip 3', u'JBLFLIP3BLK'], ['JBL Flip 3 Special Edition', u'JBLFLIP3MALTA'], ['JBL Charge 3', u'JBLCHARGE3BLKAS'], ['JBL Charge 3 Special Edition', u'JBLCHARGE3MOSAICAS'], ['JBL Clip 2', u'JBLCLIP2BLK'], ['JBL Clip 2 Special Edition', u'JBLCLIP2MALTA'], ['JBL Pulse 2', u'JBLPULSE2BLKAS'], ['JBL GO', u'JBLGOBLK'], ['JBL Xtreme', u'JBLXTREMEBLKAS'], ['JBL Xtreme Special Edition', u'JBLXTREMESQUADAS'], ['Xtreme 2', None], ['CLIP 3', u'JBLCLIP3BLK'], [None, None], ['JBL GO 2', u'JBLGO2BLK'], ['Pulse 3', u'JBLPULSE3BLKJN']]