我正在使用selenium和BeautifulSoup通过下一个按钮(从循环中单击)从网站(http://www.grownjkids.gov/ParentsFamilies/ProviderSearch)抓取数据。我以前曾在StaleElementReferenceException方面苦苦挣扎,但通过循环以重新引用页面上的元素来克服了这个问题。但是,我遇到了一个新问题-它可以一直点击到最后。但是,当我检查要写入的csv文件时,即使大多数数据看起来不错,也经常有重复的行,每5批重复一次(这是每页显示的结果数)。
我的意思的示例:https://www.dropbox.com/s/ecsew52a25ihym7/Screen%20Shot%202019-02-13%20at%2011.06.41%20AM.png?dl=0
我有一个预感,这是由于我每次尝试查找下一个按钮时,程序都会重新提取页面上的当前数据。我很困惑为什么会发生这种情况,因为据我了解,实际的抓取部分只有在您突破了内部while循环并试图找到下一个按钮并进入较大的按钮之后才发生。 (让我知道我是否对这件事比较陌生,因为我不太了解这一点。)
此外,我在每次运行程序后输出的数据是不同的(考虑到错误,这是有意义的,因为过去,StaleElementReferenceExceptions发生在零星的位置。如果每次发生此异常时都重复结果,则它将更糟糕的是,重复也会偶尔发生,甚至更糟糕的是,每次我运行该程序时,也会跳过另一批结果-我将来自2个不同输出的结果进行了交叉比较,并且其中有一些结果而不是另一个。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException, StaleElementReferenceException
from bs4 import BeautifulSoup
import csv
chrome_options = Options()
chrome_options.add_argument('--disable-gpu')
chrome_options.add_argument("--headless")
url = "http://www.grownjkids.gov/ParentsFamilies/ProviderSearch"
driver = webdriver.Chrome('###location###')
driver.implicitly_wait(10)
driver.get(url)
#clears text box
driver.find_element_by_class_name("form-control").clear()
#clicks on search button without putting in any parameters, getting all the results
search_button = driver.find_element_by_id("searchButton")
search_button.click()
df_list = []
headers = ["Rating", "Distance", "Program Type", "County", "License", "Program Name", "Address", "Phone", "Latitude", "Longitude"]
while True:
#keeps on clicking next button to fetch each group of 5 results
try:
nextButton = driver.find_element_by_class_name("next")
nextButton.send_keys('\n')
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts = 0
while (attempts < 100):
try:
nextButton = driver.find_element_by_class_name("next")
if nextButton:
nextButton.send_keys('\n')
break
except NoSuchElementException:
break
except StaleElementReferenceException:
attempts += 1
#finds table of center data on the page
table = driver.find_element_by_id("results")
html_source = table.get_attribute('innerHTML')
soup = BeautifulSoup(html_source, "lxml")
#iterates through centers, extracting the data
for center in soup.find_all("div", {"class": "col-sm-7 fields"}):
mini_list = []
#all fields except latlong
for row in center.find_all("div", {"class": "field"}):
material = row.find("div", {"class": "value"})
if material is not None:
mini_list.append(material.getText().encode("utf8").strip())
#parses latlong from link
for link in center.find_all('a', href = True):
content = link['href']
latlong = content[34:-1].split(',')
mini_list.append(latlong[0])
mini_list.append(latlong[1])
df_list.append(mini_list)
#writes content into csv
with open ('output_file.csv', "wb") as f:
writer = csv.writer(f)
writer.writerow(headers)
writer.writerows(row for row in df_list if row)
任何事情都会有所帮助!如果您对我使用selenium / BeautifulSoup / python的方式有其他建议,以改进我的未来编程,我将不胜感激。
非常感谢!
答案 0 :(得分:0)
我将使用硒来获取结果计数,然后进行API调用以获取实际结果。您可以在万一结果计数大于API的queryString的' processing section
Dim lines = File.ReadAllLines(inputFile)
Dim splitLines = lines.Select(Function(l) l.Split({vbTab}, StringSplitOptions.RemoveEmptyEntries))
Dim diseaseGrouping = splitLines.GroupBy(Function(s) s(3))
Dim patients = splitLines.Select(Function(s) s(1))
Dim dates = splitLines.Select(Function(s) DateTime.Parse(s(2)))
' report section
Dim padAmount = CInt(Console.WindowWidth / 2)
Dim muchoMedical As String = "MuchoMedical Health Center"
Dim diseaseReport As String = $"Disease Report For the Period {dates.Min():d} through {dates.Max():d}"
Console.WriteLine()
Console.WriteLine(muchoMedical.PadLeft(padAmount))
Console.WriteLine(diseaseReport.PadLeft(padAmount))
Console.WriteLine()
Console.WriteLine($"There were a total of {diseaseGrouping.Count()} unique diseases observed.")
Console.WriteLine($"A total of {patients.Count()} patient encounters were held")
For Each diseaseAndCount In diseaseGrouping
Console.WriteLine()
Console.WriteLine($"{diseaseAndCount.Key}{vbTab}{diseaseAndCount.Count()}")
Next
参数的限制的情况下,分批循环并递增pageSize
参数直到达到总数,或者按照下面的步骤进行操作,只需一次性请求所有结果。然后从json中提取您想要的内容。
currentPage
您可以在响应中迭代一系列字典:
写一些值的例子:
import requests
import json
from bs4 import BeautifulSoup as bs
from selenium import webdriver
initUrl = 'http://www.grownjkids.gov/ParentsFamilies/ProviderSearch'
driver = webdriver.Chrome()
driver.get(initUrl)
numResults = driver.find_element_by_css_selector('#totalCount').text
driver.quit()
newURL = 'http://www.grownjkids.gov/Services/GetProviders?latitude=40.2171&longitude=-74.7429&distance=10&county=&toddlers=false&preschool=false&infants=false&rating=&programTypes=&pageSize=' + numResults + '¤tPage=0'
data = requests.get(newURL).json()
如果您担心纬度和经度值,可以在使用硒时从脚本标记之一中获取它们:
我使用XHR jQuery GET的备用URL,可以通过使用页面上的开发工具(F12)然后使用F5刷新页面并检查在网络标签中发出的jquery请求来找到:
答案 1 :(得分:0)
您应该在while循环的每个迭代中阅读HTML内容。下面的示例:
while counter < oage_number_limit:
counter = counter + 1
new_data = driver.page_source
page_contents = BeautifulSoup(new_data, 'lxml')