我应该如何在ruby中使用递归方法

时间:2016-10-08 07:04:26

标签: ruby

我使用Mechanize编写了一个简单的Web抓取工具,现在我仍然坚持如何递归地获取下一页,下面是代码。

def self.generate_page  #generate a Mechainze page object,the first page
    agent = Mechanize.new
    url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
     page = agent.get(url)
     page  
end

def self.next_page(n_page)  #get next page recursively by click   next tag showed in each pages
 puts n_page   
# if I dont use puts , I get nothing , when using puts, I get 
#<Mechanize::Page:0x007fd341c70fd0>
#<Mechanize::Page:0x007fd342f2ce08>
#<Mechanize::Page:0x007fd341d0cf70>
#<Mechanize::Page:0x007fd3424ff5c0>
#<Mechanize::Page:0x007fd341e1f660>
#<Mechanize::Page:0x007fd3425ec618>
#<Mechanize::Page:0x007fd3433f3e28>
#<Mechanize::Page:0x007fd3433a2410>
#<Mechanize::Page:0x007fd342446ca0>
#<Mechanize::Page:0x007fd343462490>
#<Mechanize::Page:0x007fd341c2fe18>
#<Mechanize::Page:0x007fd342d18040>
#<Mechanize::Page:0x007fd3432c76a8>  
#which are the results I want

    np = Mechanize.new.click(n_page.link_with(:text=>/next/)) unless n_page.link_with(:text=>/next/).nil?
     result = next_page(np) unless np.nil?
     result    # here the value is empty, I dont know what is worng
end

def  self.get_page  # trying to pass the result of next_page() method 
    puts  next_page(generate_page)
    # it seems result is never passed here, 
end

我按照这两个链接What is recursion and how does it work?Ruby recursive function 但仍然无法弄清楚出了什么问题。希望有人可以帮助我..谢谢

1 个答案:

答案 0 :(得分:2)

您的代码存在一些问题:

  1. 您不应该多次拨打Mechanize.new
  2. 从风格角度来看,你做的检查太多了。
  3. 除非你喜欢递归,否则它可能更容易迭代地进行。

    要让您的next_page方法返回一个包含链中每个链接页面的数组,您可以这样写:

    # you should store the mechanize agent as a global variable
    Agent = Mechanize.new
    
    # a helper method to DRY up the code
    def click_to_next_page(page)
      Agent.click(n_page.link_with(:text=>/next/))
    end
    
    # repeatedly visits next page until none exists
    # returns all seen pages as an array
    def get_all_next_pages(n_page)
       results = []
       np = click_to_next_page(n_page)
       results.push(np)
       until !np
         np = click_to_next_page(np)
         np && results.push(np)
       end
       results
    end
    
    # testing it out (i'm not actually running this)
    base_url = "http://www.baidu.com/s?wd=intitle:#{URI.encode(WORD)}%20site:sina.com.cn&rn=50&gpc=stf#{URI.encode(TIME)}"
    root_page = Agent.get(base_url)
    next_pages = get_all_next_pages(root_page)
    puts next_pages