用Python 3.6刮掉Duckduckgo

时间:2017-10-16 13:51:36

标签: python selenium-webdriver

一个简单的问题。我可以从duckduckgo搜索的第一页中搜索结果。但是我很难进入第二页和后续页面。我使用了Selenium webdriver的Python,这对于第一页结果很好。 我用来刮第一页的代码是: -

results_url = "https://duckduckgo.com/?q=paralegal&t=h_&ia=web" 
browser.get(results_url)
results = browser.find_elements_by_id('links') 
num_page_items = len(results) 
for i in range(num_page_items): 
    print(results[i].text) 
    print(len(results)) 

nxt_page = browser.find_element_by_link_text("Load More")
if nxt_page:
    nxt_page.send_keys(Keys.PAGE_DOWN)"

有一些换行符表示新页面的开头,但它们似乎没有改变url,所以我尝试了上面的内容向下移动页面然后重复代码以找到next_page上的链接。但它不起作用。 任何帮助将非常感谢

2 个答案:

答案 0 :(得分:0)

如果我在搜索结果的源代码中搜索Load More,我就无法找到它。您是否尝试使用非javascript 版本?

您只需将html添加到网址即可使用它: https://duckduckgo.com/html?q=paralegal&t=h_&ia=web 在那里,您可以在最后找到next按钮。

这个适用于我(Chrome版本):

results_url = "https://duckduckgo.com/html?q=paralegal&t=h_&ia=web"
browser.get(results_url)
results = browser.find_elements_by_id('links')
num_page_items = len(results)
for i in range(num_page_items):
    print(results[i].text)
    print(len(results))
nxt_page = browser.find_element_by_class_name('btn--alt')
if nxt_page:
    browser.execute_script('arguments[0].scrollIntoView();', nxt_page)
    nxt_page.click()

顺便说一句:Duckduckgo也提供了一个很好的api,它可能更容易使用;)

答案 1 :(得分:0)

当您转到第二页时调用'btn--alt'的类将不起作用,因为这是两个按钮'Next'和'Previous'的相同类名,并且它单击了上一个按钮并返回我再次!

下面的代码更改非常适合我

nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
nextButton.click()

全功能:

def duckduckGoSearch(query,searchPages = None,filterTheSearch = False,searchFilter = None):

URL_ = 'https://duckduckgo.com/html?'
driver = webdriver.Chrome()
driver.get(URL_)

query = query

searchResults = {}

filterTheSearch = filterTheSearch

searchFilter = searchFilter

searchFilter = searchFilter

# # click on search textBox
# item = driver.find_element_by_xpath('//*[@id="sb_form_q"]').click()
#
# #Enter your search query
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(query)

# # Click enter to perform the search process
item = driver.find_element_by_xpath('//*[@id="search_form_input_homepage"]').send_keys(Keys.RETURN)

time.sleep(2)

page_number = 1

while True:

    # loop for the required number of pages

    if page_number <= searchPages:

        try:

            nextButton = driver.find_element_by_xpath('//input[@value="Next"]')
            nextButton.click()

            page_number += 1

            try:
                webPageSource = driver.page_source

                # parse and get the urls for the results

                soup = BeautifulSoup(webPageSource, "html.parser")

                Data_Set_div_Tags = soup.findAll('h2') + soup.findAll('div', {'class': 'result__body links_main links_deep'})

                for i in range(0, len(Data_Set_div_Tags)):

                    try:
                        resultDescription = Data_Set_div_Tags[i].findAll('a')[0].text

                        resultURL = Data_Set_div_Tags[i].findAll('a')[0]['href']

                    except:
                        print('nothing to parse')
                        pass

                    if resultURL not in searchResults.keys():
                        if filterTheSearch:
                            if searchFilter in resultURL:
                                searchResults[resultURL] = resultDescription

                        else:
                            searchResults[resultURL] = resultDescription

            except:
                print('search is done , found ', len(searchResults), 'Results')
                break
                # pass

        except:  # change something so it stops scrolling
            print('search is done , found ', len(searchResults), 'Results')
            print('no more pages')
            driver.quit()
            break


    else:
        print('search is done , found ', len(searchResults), 'Results')
        driver.quit()
        break


return searchResults