我已经通过Chrome的检查工具确定了其他页面。类型是XHR,页面通过2个数字区分。 “ https://us.pandora.net/en/charms/?sz=30&start= 30 &format = page-element”是第一页, “ https://us.pandora.net/en/charms/?sz=30&start= 60 &format = page-element”是第二页, “ https://us.pandora.net/en/charms/?sz=30&start= 90 &format = page-element”是第三页,等等。
它一直持续到第990页。
到目前为止,这是我的代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = "https://us.pandora.net/en/charms/?sz=30&start=60&format=page-element"
html = urlopen(url)
page_count = 0
while page_count < 0:
url = "https://us.pandora.net/en/charms/?sz=30&start=%d&format=page-element" %(page_count)
page_count += 30
html = urlopen(url)
我的目标是要获得所有正在销售的产品。 我发现使用检查工具读取源代码时,正在出售的商品有两类:“价格出售”和“价格标准”。
在这里,我试图获取所有产品,并使用上面的代码破解无限滚动条,并获得列表中有销售的所有产品。
def retrieve_products_sale():
all_products = soup.find_all('li', class_='grid-tile')
num_of_prods = []
for items in all_products:
if items == class_'price-standard':
num_of_prods.append(items)
print(num_of_prods)
if __name__ == '__main__':
retrieve_products_sale()
不确定如何从这里继续。
让我添加: 我的最终目标是刮掉列表中所有正在销售的产品。既有多少产品,又有多少产品。
答案 0 :(得分:0)
可能是这样的
from urllib.request import urlopen
from bs4 import BeautifulSoup
def retrieve_products_sale(soup):
all_products = soup.find_all('li', class_='grid-tile')
num_of_prods = []
for items in all_products:
if items == class_'price-standard':
num_of_prods.append(items)
print(num_of_prods)
if __name__ == '__main__':
page_count = 0
while page_count <= 990:
url = "https://us.pandora.net/en/charms/?sz=30&start=%d&format=page-element" % page_count
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
retrieve_products_sale(soup)
page_count += 30
如果您需要将所有数据集中在一个列表中,请在功能外使用列表
from urllib.request import urlopen
from bs4 import BeautifulSoup
def retrieve_products_sale(soup):
all_products = soup.find_all('li', class_='grid-tile')
num_of_prods = []
for items in all_products:
if items == class_'price-standard':
num_of_prods.append(items)
#print(num_of_prods)
return num_of_prods
if __name__ == '__main__':
page_count = 0
all_results = []
while page_count <= 990:
url = "https://us.pandora.net/en/charms/?sz=30&start=%d&format=page-element" % page_count
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
all_results += retrieve_products_sale(soup)
page_count += 30
print(all_results)
编辑:我不知道您尝试做什么
if items == class_'price-standard':
所以我用
for x in items.find_all(class_='price-standard'):
这会给出一些结果(但不是在所有页面上都显示)
from urllib.request import urlopen
from bs4 import BeautifulSoup
def retrieve_products_sale(soup):
all_products = soup.find_all('li', class_='grid-tile')
num_of_prods = []
for items in all_products:
for x in items.find_all(class_='price-standard'):
#print(x)
num_of_prods.append(x)
print(num_of_prods)
if __name__ == '__main__':
page_count = 0
while page_count <= 990:
url = "https://us.pandora.net/en/charms/?sz=30&start=%d&format=page-element" % page_count
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
retrieve_products_sale(soup)
page_count += 30
答案 1 :(得分:0)
您可以在函数中创建while循环,并使用.select()
代替find_all()
,以避免定义exrta循环来过滤出所需的项目。
import requests
from bs4 import BeautifulSoup
url = "https://us.pandora.net/en/charms/?sz=30&start={}&format=page-element"
def fetch_items(link,page):
while page<=100:
print("current page no: ",page)
res = requests.get(link.format(page),headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(res.text,"lxml")
for items in soup.select('.grid-tile .price-standard'):
product_list.append(items)
print(product_list)
page+=30
if __name__ == '__main__':
page = 0
product_list = []
fetch_items(url,page)