一个简单的问题。我可以从duckduckgo搜索的第一页中搜索结果。但是我很难进入第二页和后续页面。我使用了Selenium webdriver的Python,这对于第一页结果很好。 我用来刮第一页的代码是: -
results_url = "https://duckduckgo.com/?q=paralegal&t=h_&ia=web"
browser.get(results_url)
results = browser.find_elements_by_id('links')
num_page_items = len(results)
for i in range(num_page_items):
print(results[i].text)
print(len(results))
nxt_page = browser.find_element_by_link_text("Load More")
if nxt_page:
nxt_page.send_keys(Keys.PAGE_DOWN)"
有一些换行符表示新页面的开头,但它们似乎没有改变url,所以我尝试了上面的内容向下移动页面然后重复代码以找到next_page上的链接。但它不起作用。 任何帮助将非常感谢
答案 0 :(得分:0)
如果我在搜索结果的源代码中搜索Load More
,我就无法找到它。您是否尝试使用非javascript 版本?
您只需将html
添加到网址即可使用它:
https://duckduckgo.com/html?q=paralegal&t=h_&ia=web
在那里,您可以在最后找到next
按钮。
这个适用于我(Chrome版本):
results_url = "https://duckduckgo.com/html?q=paralegal&t=h_&ia=web"
browser.get(results_url)
results = browser.find_elements_by_id('links')
num_page_items = len(results)
for i in range(num_page_items):
print(results[i].text)
print(len(results))
nxt_page = browser.find_element_by_class_name('btn--alt')
if nxt_page:
browser.execute_script('arguments[0].scrollIntoView();', nxt_page)
nxt_page.click()
顺便说一句:Duckduckgo也提供了一个很好的api,它可能更容易使用;)
答案 1 :(得分:0)
当您转到第二页时调用'btn--alt'的类将不起作用,因为这是两个按钮'Next'和'Previous'的相同类名,并且它单击了上一个按钮并返回我再次!
下面的代码更改非常适合我
nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
nextButton.click()
全功能:
def duckduckGoSearch(query,searchPages = None,filterTheSearch = False,searchFilter = None):
URL_ = 'https://duckduckgo.com/html?'
driver = webdriver.Chrome()
driver.get(URL_)
query = query
searchResults = {}
filterTheSearch = filterTheSearch
searchFilter = searchFilter
searchFilter = searchFilter
# # click on search textBox
# item = driver.find_element_by_xpath('//*[@id="sb_form_q"]').click()
#
# #Enter your search query
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(query)
# # Click enter to perform the search process
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(Keys.RETURN)
time.sleep(2)
page_number = 1
while True:
# loop for the required number of pages
if page_number <= searchPages:
try:
nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
nextButton.click()
page_number += 1
try:
webPageSource = driver.page_source
# parse and get the urls for the results
soup = BeautifulSoup(webPageSource, "html.parser")
Data_Set_div_Tags = soup.findAll('h2') + soup.findAll('div', {'class': 'result__body links_main links_deep'})
for i in range(0, len(Data_Set_div_Tags)):
try:
resultDescription = Data_Set_div_Tags[i].findAll('a')[0].text
resultURL = Data_Set_div_Tags[i].findAll('a')[0]['href']
except:
print('nothing to parse')
pass
if resultURL not in searchResults.keys():
if filterTheSearch:
if searchFilter in resultURL:
searchResults[resultURL] = resultDescription
else:
searchResults[resultURL] = resultDescription
except:
print('search is done , found ', len(searchResults), 'Results')
break
# pass
except: # change something so it stops scrolling
print('search is done , found ', len(searchResults), 'Results')
print('no more pages')
driver.quit()
break
else:
print('search is done , found ', len(searchResults), 'Results')
driver.quit()
break
return searchResults