Question

我不熟悉从互联网上抓取数据的概念，需要一些帮助。

我正在使用python 3.6.1从Paytm（印度的电子商务网站）获取产品详细信息。

我使用以下网页网址来抓取数据 https://paytm.com/shop/g/electronics/computers-accessories/computer-components/laptop-adapters?src=1&q=graphic%20card

问题：该网站在一个页面中包含49个产品，但我只能抓取30个产品。我也尝试过在paytm上包含手机的网页，但我仍然只能抓30个，而页面中的手机数量是128个。

我的python代码：

from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
my_url='https://paytm.com/shop/g/electronics/mobile-accessories/mobiles/smart-phones?src=1&q=mobile%20phones'
page=ureq(my_url).read()
page_soup=soup(page,"html.parser")
containers=page_soup.find_all("div",{"class":"_2i1r"})
print(len(containers))
f=open("paytm_mobiles.csv","w")
f.write("Product_Name, Amount\n")

for i in containers:
    name=i.find_all("div",{"class":"_2apC"})
    print("Name :" + name[0].text)
    price=i.findAll("span", {'class':"_1kMS"})
    print("Price :"+ price[0].text)
    f.write(name[0].text.replace("."," ")+","+price[0].text+"\n")

f.close()

请帮助我克服这个问题。

Answer 1

你可以通过简单的get请求执行此操作并从实际端点调用Json - 另请注意我将items_per_page参数设置为40，通常可以将其扩展到更多，但出于某些奇怪的原因 - 如果我高于40，它只是回到30 ......无论如何这里是一个样本

import requests


query = '/g/electronics/mobile-accessories/mobiles/smart-phones?q=mobile%20phones'
currentPage = 1
totalCount = 50
while currentPage <= totalCount / 40:
    currentPage += 1
    url = 'https://catalog.paytm.com/v1'+ query + '&channel=web&page_count=' + str(currentPage) + '&items_per_page=40'
    resultsPage = requests.get(url).json()
    totalCount = resultsPage['totalCount']
    for gridResult in resultsPage['grid_layout']:
        title = gridResult['name']
        price = gridResult['actual_price']
        print("Product Name: " + title + '\nPrice: ' + str(price))
        print('\n')

您需要根据搜索内容进行更改的唯一部分是查询部分，URL的其余部分将保持不变，并且它将自动知道要经过的页数，因为totalCount位于底部该对象的开头。

Answer 2

在此页面中，加载时只包含30个产品。向下滚动产品后，通过ajax调用追加。因此，使用BeautifulSoup，您只能获得30个产品。

如何克服网站抓取数据的限制

2 个答案: