我正尝试在以下站点上下载它们拥有的所有钻石:https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll
计划是获取信息并尝试找出哪一个是我最喜欢的商品(我将做一些回归,以确定哪些商品具有很高的价值并选择我的最爱)
为此,我写了第一把刮刀。问题在于,它似乎只需要第一颗60颗钻石,而不是我在网站上看到的所有钻石。理想情况下,我希望它能带走所有100k +种不同类型(圆形,垫形等)的钻石。 如何获取所有数据?
(我认为这是因为一些新行仅在向下滚动后才加载,但我认为第一行加载的负载超过60,并且如果我向下滚动至底部,则只能显示1000)
这是我的代码:
import pandas as pd
import requests
from bs4 import BeautifulSoup
url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'
url_response = requests.get(url)
soup = BeautifulSoup(url_response.content, "html.parser")
""" Now we have the page as soup
Lets start to get the header"""
headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')
diamondsmessy = soup.find_all('a', class_='grid-row row ')
diamondscleaned = diamondsmessy[1].get_text(";")
"""Create diamonds dataframe with the header; take out the 1st value"""
header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)
""" place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""
for i in range(len(diamondsmessy)):
a = diamondsmessy[i].get_text(";")
b = a.split(";")
del b[4]
a = pd.DataFrame(b, index=header)
b = a.transpose()
diamonds = pd.concat([diamonds, b], ignore_index=True)
print(diamonds)
答案 0 :(得分:0)
我已经知道如何去做。速度不快,但本质上我需要硒才能向下滚动页面。我仍然卡在1000行中,所以循环播放一些内容来更新页面。
要帮助他人,请在此处提供代码:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
#for fun, let's time this
start = time.time()
"""Define important numbers"""
scroll_pauze_time = 0.5 #delay after scroll
scroll_number = 20 #number of times scrolled per page
pages_visited = 25 #number of times the price is increased
"""Set up the website"""
url = 'https://www.bluenile.com/diamond-search?tag=none&track=NavDiaVAll'
url_response = webdriver.Firefox()
url_response.get(url)
#minimum & max carat:
min_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(1)')
min_carat.send_keys('0.8')
min_carat.send_keys(Keys.ENTER)
max_carat = url_response.find_element_by_css_selector('.carat-filter .allowHighAscii:nth-child(2)')
max_carat.send_keys('1.05')
max_carat.send_keys(Keys.ENTER)
#Shapes of diamonds:
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(2) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(4) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(5) > .shape-filter-button-inner').click()
url_response.find_element_by_css_selector('.shape-filter-button:nth-child(7) > .shape-filter-button-inner').click()
"""Create diamonds dataframe with the header; take out the 1st value"""
soup = BeautifulSoup(url_response.page_source, "html.parser")
headerinctags = soup.find_all('div', class_='grid-header normal-header')
header = headerinctags[0].get_text(';')
header = header.split(";")
del header[0]
diamonds = pd.DataFrame(columns=header)
"""Start loop, dummy variable j"""
for j in range(pages_visited):
print(j)
url_response.execute_script("window.scrollTo(0, 0)")
#Set the minimum price
if j != 0:
min_price = url_response.find_element_by_css_selector('input[name="minValue"]')
min_price.send_keys(Keys.CONTROL,"a");
min_price.send_keys(Keys.DELETE);
a = diamonds.loc[len(diamonds.count(1))-1,"Price"]
a = a.replace('$','')
a = a.replace(',','')
min_price.send_keys(a)
min_price.send_keys(Keys.ENTER)
#Scroll down
for i in range(scroll_number):
url_response.execute_script("window.scrollTo(0, "+str((i+1)*2000)+')')
time.sleep(scroll_pauze_time)
#Grab data
soup = BeautifulSoup(url_response.page_source, "html.parser")
diamondsmessy = soup.find_all('a', class_='grid-row row ')
""" place rows into dataframe after being split; use a & b as dummy variables; take out 5th value"""
for i in range(len(diamondsmessy)):
a = diamondsmessy[i].get_text(";")
b = a.split(";")
del b[4]
a = pd.DataFrame(b, index=header)
b = a.transpose()
diamonds = pd.concat([diamonds, b], ignore_index=True)
diamonds = diamonds.drop_duplicates()
diamonds.to_csv('diamondsoutput.csv')
print(diamonds)
end = time.time()
print("This took "+ str(end-start)+" seconds")