我正在编写代码以从给定的url提取所有产品,它工作正常,但某些url包含许多页面,因此我试图通过查找保存页面url的ul来获取所有下一个页面,问题是它仅显示前3页和后一页
分页ul
<li class="plp-pagination__nav disable">
<a href="" rel="prev" class="plp-pagination__navpre">
previous </a>
</li>
<li class="plp-pagination__nav active"><a class="plp-pagination__navpages" href="javascript:void(0);">1</a></li>
<li class="plp-pagination__nav"><a class="plp-pagination__navpages" href="here is the page url ">2</a></li>
<li class="plp-pagination__nav"><a class="plp-pagination__navpages" href="here is the page url">3</a></li>
<li class="plp-pagination__nav"><a class="plp-pagination__navpages" href="here is the page url">4</a></li>
<li class="plp-pagination__nav"><a class="plp-pagination__navpages" href="here is the page url">5</a></li>
<li class="plp-pagination__nav"> <span class="plp-pagination__navplaceholder"></span></li>
<li class="plp-pagination__nav"><a class="plp-pagination__navpages" href="here is the page url">54</a></li>
<li class="plp-pagination__nav">
<a class="plp-pagination__navnext" href="here is the page url" rel="next">
next</a>
</li>
</ul>
读取功能
def update():
df = pd.DataFrame( columns=['poduct_name','image_url','price'])
#lsit of required pages
urls= ['1st page','2nd page','3rd page']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text)
#get the list of pages in pagination ul
new_pages= soup.find('ul', attrs={'class':'plp-pagination__wrapper'})
#check if there is pagination ul
if(new_pages!=None):
new_urls= new_pages.find_all('li', attrs={'class':'plp-pagination__navpages'})
for x in new_urls:
urls.append(x)
product_div= soup.find_all('div', attrs={'class':'comp-productcard__wrap'})
product_list=[]
for x in product_div:
poduct_name= x.find('p', attrs={'class':'comp-productcard__name'}).text.strip()
product_price_p= x.find('p', attrs={'class':'comp-productcard__price'}).text
product_img= x.img['src']
product_list.append({'poduct_name':poduct_name,'image_url':product_img,'price':product_price})
df = df.append(pd.DataFrame(product_list))
return df
答案 0 :(得分:0)
从外观上看,相关网站为Carrefour。 大致就是我要做的(伪代码)。
有人会要求第一页。在请求了该页面之后,可以获取类plp-pagination__navnext
的锚。然后,将使用该锚点的href作为下一个要请求的URL。开始时您没有所有页面URL的列表。请求页面后,您将抓取下一页的URL并请求它。
伪代码:
1. Load first page 2. Scrape whatever you're looking to scrape 3. Get href of next page element via selector 'a.pagination__navnext' 4. Load the next page (its URL is the href you just acquired) 5. Repeat from step 2 Stop when reached last page, AKA when next page elem's href is '' on Carrefour.
答案 1 :(得分:0)
您可以通过添加以下脚本来绕过此问题:
urls= []
home_page = requests.get("https://www.carrefourksa.com/mafsau/en/food-beverages/c/FKSA1000000?&qsort=relevance&pg")
home_soup = BeautifulSoup(home_page.content, "lxml")
page_nmb_find = home_soup.findAll("a", {"class":"plp-pagination__navpages"})
last_page = int(page_nmb_find[-1].getText())
for nmb in range(0,last_page):
urls.append(f"https://www.carrefourksa.com/mafsau/en/food-beverages/c/FKSA1000000?&qsort=relevance&pg={nmb}")
所有代码中的代码应如下所示:
def update():
df = pd.DataFrame( columns=['poduct_name','image_url','price'])
#lsit of required pages
urls= []
home_page = requests.get("https://www.carrefourksa.com/mafsau/en/food-beverages/c/FKSA1000000?&qsort=relevance&pg")
home_soup = BeautifulSoup(home_page.content, "lxml")
page_nmb_find = home_soup.findAll("a", {"class":"plp-pagination__navpages"})
last_page = int(page_nmb_find[-1].getText())
for nmb in range(0,last_page):
urls.append(f"https://www.carrefourksa.com/mafsau/en/food-beverages/c/FKSA1000000?&qsort=relevance&pg={nmb}")
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.text, "lxml")
#get the list of pages in pagination ul
new_pages= soup.find('ul', attrs={'class':'plp-pagination__wrapper'})
#check if there is pagination ul
if(new_pages!=None):
new_urls= new_pages.find_all('li', attrs={'class':'plp-pagination__navpages'})
for x in new_urls:
urls.append(x)
product_div= soup.find_all('div', attrs={'class':'comp-productcard__wrap'})
product_list=[]
for x in product_div:
poduct_name= x.find('p', attrs={'class':'comp-productcard__name'}).text.strip()
product_price_p= x.find('p', attrs={'class':'comp-productcard__price'}).text
product_img= x.img['src']
product_list.append({'poduct_name':poduct_name,'image_url':product_img,'price':product_price_p})
df = df.append(pd.DataFrame(product_list))
return df
(PS:看来product_price
不存在,所以我用product_price_p
替换了它)
希望这会有所帮助!