使用正则表达式搜索网页

时间:2016-04-22 04:23:26

标签: ruby regex web-scraping

我希望在网页中搜索包含'small business'的句子,并对页面上的每个链接执行相同操作,包括三层或四层深度。

我的尝试是这样的:

    def get_sentences
      sentences = []
      doc = Nokogiri::HTML(open("http://www.brampton.ca/EN/Business/Pages/top-links.aspx"))
      @sentences = doc.search(/[^.]*small business[^.]*\./i)
      links = doc.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] })

      doc1 = links.each { |x| Nokogiri::HTML(open(x)) }
      @sentences << doc1.search(/[^.]*small business[^.]*\./ig)
      links1 = links.each { |x| x.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }

      doc2 = links1.each { |x| Nokogiri::HTML(open(x)) }
      @sentences << doc2.search(/[^.]*small business[^.]*\./ig)
      links2 = links1.each { |x| x.search('a[href]').select{ |n| n['href'][/\.html$/] }.map{ |n| n['href'] }

      doc3 = links2.each { |x| Nokogiri::HTML(open(x)) }
        @sentences << doc3.search(/[^.]*small business[^.]*\./ig)
      end





edit, narrowed it down to this lol


@sentences = []
    doc = Nokogiri::HTML(open("https://en.wikipedia.org/wiki/Small_business"))
    regex = /[^.]*small business[^.]*\./i
    a = doc.traverse { |x| 
      if x.text =~ regex
        @sentences << x
      end

但是我可能会在一个月之后离开我的联盟。

..........工作!

0 个答案:

没有答案