如何在硒上刮擦产品详细信息页面

时间:2020-03-17 21:31:18

标签: python selenium selenium-webdriver selenium-chromedriver selenium-rc

我正在学习硒。现在,我的这段代码可以从网址https://www.daraz.com.bd/consumer-electronics/?spm=a2a0e.pdp.breadcrumb.1.4d20110bzkC0bn的字体页面中抓取所有产品标题,但是我想单击该页面的每个产品链接,该链接会将我带到产品详细信息页面,以便我可以从产品详细信息中抓取信息页。这是我的代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

#argument for incognito Chrome
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")

browser = webdriver.Chrome()

browser.get("https://www.daraz.com.bd/consumer-electronics/?spm=a2a0e.pdp.breadcrumb.1.4d20110bzkC0bn")

# Wait 20 seconds for page to load
timeout = 20
try:
    WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='c16H9d']")))
except TimeoutException:
    print("Timed out waiting for page to load")
    browser.quit()



# find_elements_by_xpath returns an array of selenium objects.
titles_element = browser.find_elements_by_xpath("//div[@class='c16H9d']")


# use list comprehension to get the actual repo titles and not the selenium objects.
titles = [x.text for x in titles_element]
# print out all the titles.
print('titles:')
print(titles, '\n')
browser.quit()

2 个答案:

答案 0 :(得分:4)

我建议您获得href并逐一打开。

您需要使用以下定位符:By.XPATH, "//div[@class='c16H9d']//a",并使用.visibility_of_all_elements_located来等待所有元素,而不是.visibility_of_element_located

然后,使用以下方法获取href:.get_attribute('href')

然后打开一个新窗口,其中已经获得了特定的href

browser.get("https://www.daraz.com.bd/consumer-electronics/?spm=a2a0e.pdp.breadcrumb.1.4d20110bzkC0bn")

# Wait 20 seconds for page to load
timeout = 20

elements = WebDriverWait(browser, timeout).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='c16H9d']//a")))

for element in elements:
    #get href
    href = element.get_attribute('href')
    print(href)
    #open new window with specific href
    browser.execute_script("window.open('" +href +"');")
    # switch to new window
    browser.switch_to.window(browser.window_handles[1])


    #......now you are on the new window, scrape here
    #example to scrape 'title' in the new window
    xx = WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.CLASS_NAME, "pdp-mod-product-badge-title")))
    print(xx.text)


    #close the new window
    browser.close()
    #back to main window
    browser.switch_to.window(browser.window_handles[0])

browser.quit()

答案 1 :(得分:2)

您可以使用BeautifulSoup来简化生活。

我已经稍微修改了代码,以说明如何浏览页面上的所有单个产品链接。

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

#argument for incognito Chrome
option = Options()
option.add_argument("--incognito")


browser = webdriver.Chrome(options=option)

browser.get("https://www.daraz.com.bd/consumer-electronics/?spm=a2a0e.pdp.breadcrumb.1.4d20110bzkC0bn")

# Wait 20 seconds for page to load
timeout = 20
try:
    WebDriverWait(browser, timeout).until(EC.visibility_of_element_located((By.XPATH, "//div[@class='c16H9d']")))
except TimeoutException:
    print("Timed out waiting for page to load")
    browser.quit()

soup = BeautifulSoup(browser.page_source, "html.parser")

product_items = soup.find_all("div", attrs={"data-qa-locator": "product-item"})
for item in product_items:
    item_url = f"https:{item.find('a')['href']}"
    print(item_url)

    browser.get(item_url)

    item_soup = BeautifulSoup(browser.page_source, "html.parser")

    # Use the item_soup to find details about the item from its url.

browser.quit()

简而言之,它就是注释部分中提到的arundeep chohan。您可以选择创建browser的新实例,例如说browser1 = webdriver.Chrome(),它可以浏览所有产品URL。

我还意识到incognito模式在您的脚本中不起作用。 您需要定义chrome_options并将其作为参数传递给webdriver.Chrome方法。