从目标网站抓取产品和存储信息

时间:2019-02-22 18:32:58

标签: python-3.x web-scraping beautifulsoup

我不熟悉网页抓取功能,希望从Target网站上获取产品数据。

图像的突出显示部分

enter image description here

我已经能够获得产品名称和价格,但是使用BeautifulSoup无法找到其余信息。例如,在检查邮政编码时,它会显示带有数据测试标签的邮政编码,但是在搜索标签时找不到。有没有人曾经经历过或知道获取此信息的方法?

使用Python 3和BeautifulSoup。

不确定表达此问题的最佳方法,所以让我知道您是否需要更多信息或是否需要改写。

<a href="#" class="h-text-underline Link-sc-1khjl8b-0 jvxzGg" data-test="storeFinderZipToggle">35401</a>

import requests
from bs4 import BeautifulSoup

f = open("demofile.txt", "w")

Page_Source = "https://www.target.com/p/nintendo-switch-with-neon-blue-and-neon-red-joy-con/-/A-52189185"

page = requests.get(Page_Source)

soup = BeautifulSoup(page.content, 'html.parser')

#write all the html code to a file to compare source files
f.write(str(soup))

#should contain city location but Secondary header can't be found
#location = soup.find("div", {'class', 'HeaderSecondary'})


#inside the secondary header should contain the store name but is not found
#store_location = location.find('div', {'data-test': 'storeId-store-name'})
#store_location = location.find('button', {'id': 'storeId-utility-NavBtn'})



#contains the rest of the information interested in
main_container = soup.find(id="mainContainer")
#complete_product_name = soup('span',attrs={'data-test':'product-title'})[0].text
product_price = soup.find("span", {'data-test': 'product-price'})
product_title = soup.find("span", {'data-test': 'product-title'})

flexible_fulfillment = main_container.find('div', {'data-test': 'flexible_fulfillment'})

#test = product_zip.find_all('a')
#example = soup.find_all("div", {'data-test': 'storePickUpType'})

example = soup.findAll('div', attrs={'data-test':'maxOrderQuantityTxt'})
print(product_title)
print(product_price)

print(flexible_fulfillment)


f.close()

1 个答案:

答案 0 :(得分:0)

更新:使用Selenium的有用技巧。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException


#launch url
url = "https://www.target.com/p/nintendo-switch-with-neon-blue-and-neon-red-joy-con/-/A-52189185"

# create a new Firefox session
driver = webdriver.Safari()
driver.implicitly_wait(15)
driver.get(url)

try:
    store_name_element = driver.find_element(By.XPATH, '//*[@id="storeId-utilityNavBtn"]/div[2]')
    print(store_name_element.get_attribute('innerText'))
except Exception:
    print "There's no store name available"

try:
    item_name_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[1]/div[1]/h1/span')
    print(item_name_element.get_attribute('innerText'))
except Exception:
    print "There's no item name available"

try:
    price_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[1]/span')
    print(price_element.get_attribute('innerText'))
except Exception:
    print "There's no pricce available"

try:
    zip_code_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[6]/div/div[1]/div[1]/div/div[1]/a')
    print(zip_code_element.get_attribute('innerText'))
except Exception:
    print "There's no zip code available"

try:
    order_by_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[6]/div/div[1]/div[2]/p')
    print(order_by_element.get_attribute('innerText'))
except Exception:
    print "There's no order by time available"

try:
    arrival_date_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[6]/div/div[1]/div[2]/div/div/span')
    print(arrival_date_element.get_attribute('innerText'))
except Exception:
    print "There's no arrival date available"

try:
    shipping_cost_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[6]/div/div[2]/div/div[1]/div[1]/div[1]/div[1]')
    print(shipping_cost_element.get_attribute('innerText'))
except Exception:
    print "There's no shipping cost available"

try:
    current_inventory_element = driver.find_element(By.XPATH, '//*[@id="mainContainer"]/div/div/div[1]/div[2]/div/div[6]/div/div[2]/div/div[1]/div[1]/div[1]/div[2]')
    print(current_inventory_element.get_attribute('innerText'))
except Exception:
    print "There's no current inventory available"

driver.quit()

尽管我注意到此代码的一件事是它与它的结果不一致。有时我会收到错误消息,指出未找到该元素,而其他时候它将找到该元素。有人知道为什么会这样吗?是因为我经常请求该网站吗?