我正在学习Python报废技术,但我仍然遇到了抓取Ajax页面like this one的问题。
我想废弃页面中的所有药品名称和详细信息。因为我读了关于堆栈溢出的大部分答案,但我在报废后没有得到正确的数据。我还尝试使用selenium废弃或发送伪造帖子请求,但它失败了。
所以请特别关注这个Ajax报废主题这个页面,因为从下拉选项中选择一个选项时会触发ajax。 另外,请为我提供一些ajax页面报废的资源。
//使用selenium
from selenium import webdriver
import bs4 as bs
import lxml
import requests
path_to_chrome = '/home/brutal/Desktop/chromedriver'
browser = webdriver.Chrome(executable_path = path_to_chrome)
url = 'https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/'
browser.get(url)
browser.find_element_by_xpath('//*[@id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option[contains(text(), "Ohio")]').click()
new_url = browser.current_url
r = requests.get(new_url)
print(r.content)
答案 0 :(得分:1)
ChromeDriver,您可以下载here
normalize-space
用于从网络文字中删除垃圾,例如x0
from time import sleep
from selenium import webdriver
from lxml.html import fromstring
data = {}
driver = webdriver.Chrome('PATH TO YOUR DRIVER/chromedriver') # i.e '/home/superman/www/myproject/chromedriver'
driver.get('https://www.gianteagle.com/Pharmacy/Savings/4-10-Dollar-Drug-Program/Generic-Drug-Program/')
# Loop states
for i in range(2, 7):
dropdown_state = driver.find_element(by='id', value='ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList')
# open dropdown
dropdown_state.click()
# click state
driver.find_element_by_xpath('//*[@id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']').click()
# let download the page
sleep(3)
# prepare HTML
page_content = driver.page_source
tree = fromstring(page_content)
state = tree.xpath('//*[@id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_StateList"]/option['+str(i)+']/text()')[0]
data[state] = []
# Loop products inside the state
for line in tree.xpath('//*[@id="ctl00_RegionPage_RegionPageMainContent_RegionPageContent_userControl_gridSearchResults"]/tbody/tr[@style]'):
med_type = line.xpath('normalize-space(.//td[@class="medication-type"])')
generic_name = line.xpath('normalize-space(.//td[@class="generic-name"])')
brand_name = line.xpath('normalize-space(.//td[@class="brand-name hidden-xs"])')
strength = line.xpath('normalize-space(.//td[@class="strength"])')
form = line.xpath('normalize-space(.//td[@class="form"])')
qty_30_day = line.xpath('normalize-space(.//td[@class="30-qty"])')
price_30_day = line.xpath('normalize-space(.//td[@class="30-price"])')
qty_90_day = line.xpath('normalize-space(.//td[@class="90-qty hidden-xs"])')
price_90_day = line.xpath('normalize-space(.//td[@class="90-price hidden-xs"])')
data[state].append(dict(med_type=med_type,
generic_name=generic_name,
brand_name=brand_name,
strength=strength,
form=form,
qty_30_day=qty_30_day,
price_30_day=price_30_day,
qty_90_day=qty_90_day,
price_90_day=price_90_day))
print('data:', data)
driver.quit()