使用Python抓取产品类别

时间:2019-05-21 11:39:08

标签: python web-scraping beautifulsoup scrapy

我正在尝试抓取此页面,它有大约21000种产品

我的问题是我如何获得21000个产品的所有产品名称,图像和完整的类别层次结构。 图片和名称在同一页面上,但类别在实际产品页面内。

由于分页,我只能获得首页上显示的32个产品标题和图像

从首页获取标题的代码

import requests
from bs4 import BeautifulSoup

main_url = "https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1"

import requests
result = requests.get(main_url)
print(result.text)

sp = BeautifulSoup(result.text,'html.parser')
print(sp.prettify())

getallTitle = [x.a.get('title') for x in sp.findAll("div", class_ = "_3WhJ")]

print(str(len(getallTitle )) + " fetched products Title")
print("/n")
print(getallTitle[2])

3 个答案:

答案 0 :(得分:2)

该页面请求内容如下的第一页(返回json)。看看是否可以更改参数以获得所有结果

您似乎可以通过更改网址以包含例如页面来更改引荐标头和正文中的当前页面。

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&page=2

您可以从第一个请求中提取总结果计数

r['filters'][0]['values'][0]['count']

您知道您要批量分配32个(尽管尝试将其增加到最大可能值)。然后,您可以计算页面/请求数,然后循环执行。

Python(第1页请求)

import requests

headers = {
    'Content-Type' : 'application/json',
    'Referer' : 'https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1',
    'User-Agent' : 'Mozilla/5.0'
}

body = {"tracking":{"current_page":"https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1","prev_page":''},"context":{"device":{"os":"Win32","device_type":"PC","browser_uuid":"GA1.2.105449259.1558439396","ua":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36","connection_type":"Unknown"},"channel":"WEB","user":{"ga_id":"GA1.2.105449259.1558439396","user_id":''}}}

r = requests.post('https://middleware.paytmmall.com/fmcg-foods-glpid-101405?channel=web&child_site_id=6&site_id=2&version=2&discoverability=online&use_mw=1&items_per_page=32', json = body, headers = headers).json()

答案 1 :(得分:2)

您可以访问每个页面的json响应。但是请记住,每页只有32个产品,这意味着您将请求659次。

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        try:
            category = product['attributes']['type']
        except:
            category = 'N/A'

        print ('%-20s ₹%-5s %-20s ₹%s' %(category, actual_price, brand, name))

输出:

Tea                  ₹185   Red Label            Red Label Tea 500 gm
Tea                  ₹93    Tata Tea Premium     Tata Tea Premium Leaf 250 gm
Tea                  ₹240   Red Label            Red Label Natural Care Tea 500 gm
N/A                  ₹230   Taj Mahal            Taj Mahal Tea 500 gm
Tea                  ₹120   Red Label            Red Label Natural Care Tea 250 gm
Dairy Whitener       ₹413   Nestle               Nestle Everyday Dairy Whitener Milk 1 kg
Sauces               ₹125   Kissan               Kissan Fresh Tomato Ketchup 950 gm
Whole Oats           ₹186   Quaker               Quaker Oats 1 kg Pouch
Tea                  ₹188   Tata Tea Premium     Tata Tea Premium Leaf 500 gm
Coffee               ₹90    Bru                  BRU Instant Coffee 50 gm
Almond               ₹300   Freshco              Freshco California Almonds 200Gm
Jam                  ₹250   Kissan               Kissan Mixed Fruit Jam 1.04 kg
Almond               ₹799   glomin               Glomin California Almond Raw 500 G 1Pc
Sauces               ₹152   Kissan               Kissan Sweet & Spicy Sauce 1 kg
Cashew Nut           ₹180   Nutty Gritties       Nutty Gritties Roasted Salted Cashews 80G
Coffee               ₹120   Bru                  BRU Gold Instant Coffee 50 gm
Tea                  ₹480   Red Label            Red Label Natural Care Tea 1 kg
Almond               ₹310   Miltop               Miltop California Almonds 250G
Cashew Nut           ₹425   glomin               Glomin Cashew 250 G 1Pc
Almond               ₹600   Wonderland           Wonderland California Almond 500g
Almond               ₹499   Shivram Peshawari & Bros Shivram Peshawari & Bros California Almonds/Badam 250 Grams
Peanut Butter        ₹425   Pintola              Pintola All Natural Peanut Butter 1 kg (Crunchy)
Soups                ₹55    Knorr                Knorr Classic Tomato Soup 53 gm
Peanut Butter        ₹425   Pintola              Pintola All Natural Peanut Butter 1 kg (Creamy)
Peanut Butter        ₹349   Pintola              Pintola Classic Peanut Butter 1 kg (Crcuncy)
Peanut Butter        ₹165   Pintola              Pintola All Natural Peanut Butter 350 gm (Crunchy)
Almond               ₹1599  glomin               Glomin Raw Almonds 1Kg (Pack Of 1)
Almond               ₹150   Nutty Gritties       Nutty Gritties Almonds 100G
Raisin               ₹250   OOSH                 Oosh Seedless Black Raisin 250G
N/A                  ₹455   Taj Mahal            Taj Mahal Tea 1 kg

编辑:

如果需要层次结构,则需要转到每个产品的链接并将其拉出。我提供了执行此操作的代码,但请记住,这将需要FORVER。假设每个请求大约需要2-3秒,那么您将花费近18个小时。

# Iterate through each page
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        new_url = product['newurl']

        jsonData_product = requests.get(new_url, headers=headers).json()

        category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

输出:

Name: Red Label Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRED-LABEL-TETBL497475164B959/a_4.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea 500 gm

Name: Tata Tea Premium Leaf 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTATA-TEA-PREINNO985832A1E145F5/8.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Tata Tea Premium Leaf 250 gm

Name: Red Label Natural Care Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASRLNC-C-500GNTBL4974726639099/a_14.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Tea & Coffee 500 Gm

Name: Taj Mahal Tea 500 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASTAJ-MAHAL-TEBIGB985832F0512392/0.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Taj Mahal Tea 500 gm

Name: Red Label Natural Care Tea 250 gm
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNEW-RED-LABETBL49747FC4B364F/a_7.jpg
Category: Supermarket/Foods/Drinks & Beverages/Tea & Coffee/Red Label Natural Care Tea 250 gm

Name: Nestle Everyday Dairy Whitener Milk 1 kg
Image: https://assetscdn1.paytm.com/images/catalog/product/F/FA/FASNESTLE-EVERYTBL497478E1F2966/a_8.jpg
Category: Supermarket/Foods/Dairy Products/Dairy Whitener/Nestle Everyday Dairy Whitener Milk 1 kg

OR

如果所有产品都属于同一类别,那么您真的只需要获取第一个产品的类别,然后在遍历页面时将其应用于所有其他产品:

import requests
import math

url = 'https://middleware.paytmmall.com/fmcg-foods-glpid-101405'

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}

payload = {
'channel': 'web',
'child_site_id': '6',
'site_id': '2',
'version': '2',
'discoverability': 'online',
'use_mw': '1',
'category': '101405',
'page': '1',
'page_count': '1',
'items_per_page': '32'}

# Get total pages needed
jsonData = requests.post(url, headers=headers, data=payload).json()
total_count = jsonData['totalCount']
total_pages = total_count / 32
pages = math.ceil(total_pages)


# Iterate through each page
category = ''
for page in range(1,pages + 1):
    payload.update({'page':page, 'page_count':page})

    jsonData = requests.post(url, headers=headers, data=payload).json()

    for product in jsonData['grid_layout']:
        name = product['name']
        brand = product['brand']
        actual_price = product['actual_price']
        img = product['image_url']
        category_id = product['category_id']

        if category == '':
            new_url = product['newurl']
            jsonData_product = requests.get(new_url, headers=headers).json()
            category = '/'.join( [each['name'] for each in jsonData_product['ancestors'] ][:-1] )

        print ('Name: %s\nImage: %s\nCategory: %s\n' %(name, img, category))

答案 2 :(得分:1)

这是解决分页的方法。 分页没什么,但是它只是按需发送请求,而不是立即获取请求。这意味着每次您单击任何页码,您都会看到根据网站设计的更改。 就您而言,每次单击任何页面链接时,URL查询都会更改。产生的网址是

https://paytmmall.com/fmcg-foods-glpid-101405?discoverability=online&use_mw=1&category=101405&page=2

如果您继续将page = 2更改为要抓取的任何页面,则可以抓取该网站。

  

逻辑:

main_url = "https://paytmmall.com/fmcg-foods-glpid-101405? discoverability=online&use_mw=1&category=101405&page="

for i in range(1,totalnumberofpages):
 url = main_url+str(i)
 #you logic to scrape one url