我正在开展一个刮刮项目 - 看看英国不同产品的再生公司提供的产品
我遇到了这个网站的问题:
http://www.musicmagpie.co.uk/entertainment/
我有一个条形码列表,我想找到他们的购买价格(在条形码中输入搜索框并点击“添加”按钮)。我已经设法让Selenium Webdriver正常工作,但这是一个非常缓慢的过程,如果网站不在我身边并且在某些时候杀死我的进程,我就无法运行大量的条形码。
我的目标是每秒进行大约1次搜索,此时平均需要大约5秒以上。这是我正在运行的代码:
driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe")
driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media')
countx = 0
count = 0
for EAN in EANs:
countx += 1
count += 1
if count % 200 == 0:
driver.close()
driver = webdriver.Chrome(r"C:\Users\leonK\Documents\Python Scripts\chromedriver.exe")
driver.get('http://www.musicmagpie.co.uk/start-selling/basket-media')
count = 1
driver.find_element_by_xpath("""//*[@id="txtBarcode"]""").send_keys(str(EAN))
#If popup window appears, exception will close it as first click will fail.
try:
driver.find_element_by_xpath("""//*[@id="getValSmall"]""").click()
except:
driver.find_element_by_xpath("""//*[@id="gform_close"]""").click()
prodnames = driver.find_elements_by_xpath("""//div[@class='col_Title']""")
if len(prodnames) == count:
ProductName.append(prodnames[0].text)
BuyPrice.append(driver.find_elements_by_xpath("""//div[@class='col_Price']""")[0].text)
else:
ProductName.append('nan')
BuyPrice.append('nan')
count = len(prodnames)
elapsed = time.clock()
print('MusicMagpieScraper:', EAN, '--', countx, '/', len(EANs), '--', (elapsed - start), 's')
driver.close()
我有使用Urllib和使用BeautifulSoup解析的经验,并且更愿意切换到那个。但是,我不知道如何在没有webdriver执行点击的情况下提取数据。
任何建议/提示都会非常贴切!
添加了:
添加按钮链接为:
__doPostBack('ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall','')
这是我发现的JS功能:
{name: "__EVENTTARGET", value: ""}
{name: "__EVENTARGUMENT", value: ""}
{name: "__VIEWSTATE", value: "/wEPDwUENTM4MQ9kFgJmD2QWAmYPZBYCZg9kFgJmD2QWBGYPZB…uZSAhaW1wb3J0YW50O2RkQweS+jvDtjK8er7dCKBBRwOWWuE="}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$signIn_8$hdn_BasketValue", value: "2"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$wtmBarcode_ClientState", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$txtSearch", value: "Enter item (e.g. iPhone 5)"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedTechVal_11$wmSearch_ClientState", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$LegoVal_12$ddlLego", value: "-999"}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher_sm", value: ""}
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$TotalValueBox_14$txtPromoVoucher", value: ""}
{name: "__SCROLLPOSITIONX", value: "0"}
{name: "__SCROLLPOSITIONY", value: "0"}
{name: "hiddenInputToUpdateATBuffer_CommonToolkitScripts", value: "1"}
第4行是输入条形码的地方:
{name: "ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode", value: "5051275026429"}
希望有用的信息,我不知道从哪里去,谷歌没有太多帮助
答案 0 :(得分:1)
我设法使用请求找到了解决方案
get_response = requests.get(url='http://www.musicmagpie.co.uk/start-selling/')
post_data = {'__EVENTTARGET' : 'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$getValSmall',
'__EVENTARGUMENT' : '',
'ctl00$ctl00$ctl00$ContentPlaceHolderDefault$mainContent$tabbedMediaVal_10$txtBarcode' : ean}
# POST some form-encoded data:
post_response = requests.post(url='http://www.musicmagpie.co.uk/start-selling/', data=post_data)
soup = BeautifulSoup(post_response.text, "lxml")
BuyPrice = soup.find('div', class_='col_Price').text.rstrip()
ProductName = soup.find('div', class_='col_Title').text.rstrip()
此代码发送函数/值的字典(可能不是正确的术语!),它会触发一个易于解析的响应,从中我提取了我想要的数据!