Nokogiri解析切割元素之间的内容

时间:2014-07-15 20:19:50

标签: html ruby parsing nokogiri open-uri

我用Google搜索了一半的互联网搜索帮助。

所以,我需要的是:

我有像这样解析的HTML结构:

<div class="foo">
  <div class='bar' dir='ltr'>
    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_1' class='dx'>
          <a href='/bar32560'>1</a>
        </span>
        Neque porro 
        <a href='/xyz' class='mr'>+</a>
        quisquam est 
        <a href='/xyz' class='mr'>+</a>
        qui. 
      </p>
    </div>
    <div id='p2' class='par'>
      <p class='sb'>
        <span id='dc_1_2' class='dx'>
          <a href='/foo12356'>2</a>
        </span>
        dolorem ipsum 
        <a href='/xyz' class='mr'>+</a>
        quia dolor sit amet, 
        <a href='/xyz' class='mr'>+</a>
        consectetur, adipisci velit.
      </p>
    </div>
    <div id='p3' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>3</a>
        </span>
        Neque porro quisquam 
        <a href='/xyz' class='mr'>+</a>
        est qui dolorem ipsum quia dolor sit 
        <a href='/xyz' class='mr'>+</a>
        amet, t.
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_4' class='dx'>
          <a href='/barefoot4135'>4</a>
        </span>
        consectetur, 
        <a href='/xyz' class='mr'>+</a>
        adipisci veli.
        <span id='dc_1_5' class='dx'>
          <a href='/barfoo05123'>5</a>
       </span>
       Neque porro 
       <a href='/xyz' class='mr'>+</a>
       quisquam est
       <a href='/xyz' class='mr'>+</a>
       qui.
     </p>
   </div>
 </div>
</div>

我需要什么(英文):抓每个段落但我需要最终刮下的文本对象内容:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.

编码我现在使用的内容:

page = Nokogiri::HTML(open(url))
x = page.css('.mr').remove
x.xpath("//div[contains(@class, 'par')]").map do |node|
  body = node.text
end

我的结果如下:

scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t. 4 consectetur, adipisci veli. 5 Neque porro quisquam est qui.

所以这会从div段类&#39; par&#39;中删除全文。我需要在每个跨度后用他的内容 - 数字来刮取整个文本。或者在每个跨度之前剪掉那些div。

我需要类似的东西:

SPAN.text + P.text - a.mr

我不知道...怎么做

请帮我解析一下。我需要在每个跨度之后/之前刮刮 - 我想。

请帮助,我已经尝试了所有我发现的东西。


编辑DUCK @ Duck1337:

我使用了以下代码:

def verses
    page = Nokogiri::HTML(open(url))
    i=0
    x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM").map do |node|
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

我需要这个,因为我解析了一个带有文字的大网站。还有更多的方法。所以我的最终输出如下:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam est qui.

但是,如果我单个句子有多个句子,那么你的代码就会用每个句子分开。所以这要分得很多。

例如:

    <div id='p1' class='par'>
      <p class='sb'>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>1</a>
        </span>
        Neque porro quisquam. Est qui dolorem
        <a href='/xyz' class='mr'>+</a>
        <span id='dc_1_3' class='dx'>
          <a href='/foobar4586'>2</a>
        </span>
        est qui dolorem ipsum quia dolor sit. 
        <a href='/xyz' class='mr'>+</a>
        amet, t.

您的代码分割如下:

Saved record with: book: 1, chapter: 1, verse: 1, body: 1 Neque porro quisquam.
Saved record with: book: 1, chapter: 1, verse: 2, body: Est qui dolorem
Saved record with: book: 1, chapter: 1, verse: 3, body: 2 est qui dolorem ipsum quia dolor sit.

希望你的意思。真的很感谢你的支持。如果你可以修改它就会很棒!


编辑:@KARDEIZ

谢谢你的回答!当我在我的方法中使用你的代码时:它解析了非常糟糕的东西。

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  #page.css(".mr").remove
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    body = node
    VerseSource.new(body, book_num, number, i)
  end
end

输出如下:

Saved record with: book: 1, chapter: 1, verse: 1, body:  <here is last part of last sentence in first paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 2, body:  <here is last part of last sentence in second paragraph after "+" sign(href) and before last "+"(href)>
Saved record with: book: 1, chapter: 1, verse: 3, body:
Saved record with: book: 1, chapter: 1, verse: 4, body:
Saved record with: book: 1, chapter: 1, verse: 5, body:  <here is last sentence in third paragraph. It is after last "+" in this paragraph and have no more "+" signs(href)

正如你所看到的,我不知道它是如何造成这样的混乱;]你能用它做更多的事情吗?非常感谢!


问候!

3 个答案:

答案 0 :(得分:0)

我将您的输入保存为&#34; temp.html&#34;在我的桌面上。

require 'open-uri'
require 'nokogiri'

$page_html = Nokogiri::HTML.parse(open("/home/user/Desktop/temp.html"))

output = $page_html.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ". HAM").split(" HAM")

# I found the pattern ". " in every line, so i replaced ". " with (". HAM")
# I did that by using gsub(". ", ". HAM") this means replace ". " with ". HAM"

# then i split up the string with " HAM" so it preserved the "." in each item in the array


output = ["1 Neque porro quisquam est qui.", "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.", "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.", "4 consectetur, adipisci veli.", "5 Neque porro quisquam est qui."]

编辑:

 %w[nokogiri open-uri].each{|gem| require gem}     

 $url = "/home/user/Desktop/temp.html"
 def verses
     page = Nokogiri::HTML(open($url))
     i=0
     x = page.css("p").text.gsub("+", " ").split.join(" ").gsub(". ", ".    HAM").split(" HAM") do |node|
         i+=1
         body = node
         VerseSource.new(body, book_num, number, i)
    end
 end

答案 1 :(得分:0)

尝试类似:

x.xpath("//div[contains(@class, 'par')]//span").map do |node|
  out = node.content.strip
  if following = node.at_xpath('following-sibling::text()')
    out << ' ' << following.content.strip
  end
  out
end

following-sibling::text() XPATH将获得跨越后的第一个文本节点。

修改

我认为这可以满足您的需求:

html.xpath("//div[contains(@class, 'par')]//span").map do |node|
  node.content.strip.tap do |out|
    while nn = node.next
      break if nn.name == 'span'
      out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
      node = nn
    end
  end  
end

输出:

[
  "1 Neque porro quisquam est qui.",
  "2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.",
  "3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.",
  "4 consectetur, adipisci veli.",
  "5 Neque porro quisquam est qui."
]

使用纯XPath也可以做到这一点(参见XPath axis, get all following nodes until),但从编码的角度来看,这个解决方案更简单。

编辑2

试试这个:

def verses
  page = Nokogiri::HTML(open(url))
  i=0
  page.xpath("//div[contains(@class, 'par')]//span").map do |node|
    body = node.content.strip.tap do |out|
      while nn = node.next
        break if nn.name == 'span'
        out << ' ' << nn.content.strip if nn.text? && !nn.content.strip.empty?
        node = nn
      end
    end
    i+=1
    VerseSource.new(body, book_num, number, i)
  end
end

答案 2 :(得分:0)

require 'nokogiri'

your_html =<<END_OF_HTML
<your html here>
END_OF_HTML

doc  = Nokogiri::HTML(your_html)
text_nodes = doc.xpath("//div[contains(@class, 'par')]/p/child::text()")

results = text_nodes.reject do |text_node| 
  text_node.text.match /\A \s+ \z/x  #Eliminate whitespace nodes
end

results.each_with_index do |node, i|
  puts "scraped_body#{i+1} => #{node.text.strip}"
end


--output:--
scraped_body1 => Neque porro quisquam est qui.
scraped_body2 => dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body3 => Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body4 => consectetur, adipisci veli.
scraped_body5 => Neque porro quisquam est qui.

回答新的HTML:

require 'nokogiri'

html = <<END_OF_HTML
your new html here
END_OF_HTML

html_doc  = Nokogiri::HTML(html)
current_group_number = nil
non_ws_text = []  #non_whitespace_text for each group

html_doc.css("div.par > p").each do |p|   #p's that are direct children of <div class="par">
  p.xpath("./node()").each do |node|  #All Text and Element nodes that are direct children of p tag.
    case node
    when  Nokogiri::XML::Element
      if node.name == 'span'
        node.xpath(".//a").each do |a|  #Step through all the <a> tags inside the <span>
          md = a.text.match(/\A (\d+) \z/xm)  #Check for numbers

          if md  #Then found a number, so it's the start of the next group
            if current_group_number  #then print the results for the current group
              print "scraped_body #{current_group_number} => "
              puts "#{current_group_number} #{non_ws_text.join(' ')}"
              non_ws_text = []
            end
            current_group_number = md[1] #Record the next group number 
            break  #Only look for the first <a> tag containing a number
          end

        end
      end

    when Nokogiri::XML::Text
      text = node.text
      non_ws_text << text.strip if text !~ /\A \s+ \z/xm 
    end

  end
end

#For the last group: 
print "scraped_body #{current_group_number} => "
puts "#{current_group_number} #{non_ws_text.join(' ')}"

--output:--
scraped_body 1 => 1 Neque porro quisquam est qui.
scraped_body 2 => 2 dolorem ipsum quia dolor sit amet, consectetur, adipisci velit.
scraped_body 3 => 3 Neque porro quisquam est qui dolorem ipsum quia dolor sit amet, t.
scraped_body 4 => 4 consectetur, adipisci veli.
scraped_body 5 => 5 Neque porro quisquam est qui.