通过网络抓取python加载多个页面

时间:2020-06-20 12:04:50

标签: python for-loop web-scraping

我为网络抓取编写了python代码,以便可以从flipkart导入数据。
我需要加载多个页面,以便可以导入许多产品,但是现在只有1个产品页面即将出现。

from urllib.request import urlopen as uReq
from requests import get
from bs4 import BeautifulSoup as soup
import tablib 


my_url = 'https://www.xxxxxx.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=1'

uClient2 = uReq(my_url)
page_html = uClient2.read()
uClient2.close()

page_soup = soup(page_html, "html.parser")

containers11 = page_soup.findAll("div",{"class":"_3O0U0u"}) 

filename = "FoodProcessor.csv"
f = open(filename, "w", encoding='utf-8-sig')
headers = "Product, Price, Description \n"
f.write(headers)

for container in containers11:
    title_container = container.findAll("div",{"class":"_3wU53n"})
    product_name = title_container[0].text

    price_con = container.findAll("div",{"class":"_1vC4OE _2rQ-NK"})
    price = price_con[0].text



    description_container = container.findAll("ul",{"class":"vFw0gD"})
    product_description = description_container[0].text


    print("Product: " + product_name)
    print("Price: " + price)
    print("Description" + product_description)
    f.write(product_name + "," + price.replace(",","") +"," + product_description +"\n")

f.close()

3 个答案:

答案 0 :(得分:1)

您必须检查下一页按钮是否存在。如果是,则返回True,转到下一页并开始抓取;如果否,则返回False,然后移至下一个容器。首先检查该按钮的类名称。

# to check if a pagination exists on the page:
    
    def go_next_page():
        try:
            button = driver.find_element_by_xpath('//a[@class="<class name>"]')
            return True, button
        except NoSuchElementException:
            return False, None

答案 1 :(得分:0)

您首先可以获取可用的页面数,然后为每个页面进行迭代并分别解析数据。

就像您更改页面的URL一样

  • 'https://www.flipkart.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=1'指向第1页
  • 'https://www.flipkart.com/food-processors/pr?sid=j9e%2Cm38%2Crj3&page=2'指向第2页

答案 2 :(得分:0)

try:
        next_btn = driver.find_element_by_xpath("//a//span[text()='Next']")
        next_btn.click()
    except ElementClickInterceptedException as ec: 
        classes = "_3ighFh"
        overlay = driver.find_element_by_xpath("(//div[@class='{}'])[last()]".format(classes))
        driver.execute_script("arguments[0].style.visibility = 'hidden'",overlay)
        next_btn = driver.find_element_by_xpath("//a//span[text()='Next']")
        next_btn.click()
    
    except Exception as e:
        print(str(e.msg()))
        break
except TimeoutException:
    print("Page Timed Out")

driver.quit()