用Nokogiri抓住b元素之间的所有东西

时间:2015-04-20 09:42:06

标签: ruby nokogiri

这是HTML:

<tr class="level2">
    <td> 
        <b>word</b>
        "Text I need"
        <b>word</b>
        "Text I need"
        <b>word</b>
        "Text I need"
        <b>word</b>
        "Text I need"
        <i>blabla</>
        "Text I need"
        <b>word</b>
        "Text I need"
        <i>blabla</>
        "Text I need"
        <i>blabla</>
        <b>word</b>

    </td>
</tr>

我想选择<b>元素之间的每个节点,然后稍后遍历每个节点。目前我有:

translations = page.xpath('//text()[preceding-sibling::b]')

<b>元素之间只有文本时,它可以正常工作。但是,当<i>元素之间出现一个或多个<b>标记时,我只获得节点中的第一个文本。节点中的剩余文本将转到以下节点。 我想要输出:

node 1: Text I need 
node 2: Text I need 
node 3: Text I need 
node 4: Text I need 
        Text I need 
node 5: Text I need 
        Text I need 

这是代码:

require 'rubygems'
require 'open-uri'
require 'nokogiri' #parse html
require 'csv'

DATA_DIR = "words"
Dir.mkdir(DATA_DIR) unless File.exists?(DATA_DIR) # making directory
BASE_LINK = "http://dict.ibs.ee/translate.cgi?word=" 
LANGUAGE = "&language=English"
WILDCARD = "*"
SLEEP_TIME = 0.1 # sleep between web requests in seconds
counter = 1 #counter for file name
i = 1
name = "IBSwords"+"#{counter}"+".csv"

alphabet = %w[a b c d e f g h i j k l m n o p q r s t u v w x y z]
four_letter_combinations = alphabet.product(alphabet, alphabet, alphabet).map(&:join)
#combination from 4 letters
for combination in four_letter_combinations
  begin
    i += 1
      if (i % 150000 ) == 0
        counter += 1
        name = "IBSwords"+"#{counter}"+".csv" 
      end
    sleep (SLEEP_TIME) 
    link = BASE_LINK+"about"+LANGUAGE
    page = Nokogiri::HTML(open(link)) #retry in 60 sec if no connection
  rescue StandardError=>e
    puts "#{e} No Connection, retrying..."
    sleep 60
  retry
  else 
    unless page.css('body > div > center > table > tbody > tr > td > div > center > table > tbody > tr > td > blockquote > dl > dd > b').nil?
      puts "*****************#{i} #{combination}***********"
      en_words = page.css('blockquote > dl > dd > b')
      #ee_words = page.css('blockquote > dl > dd').to_s.split(/<b>.*<\/b>/)
      ee_words = page.xpath('//text()[preceding-sibling::b]') 
      # iterating through 
      en_words.zip(ee_words).each  do |word, ee_word|
      en_word = word.text.chomp.strip
      ee_trans = ee_word.text.chomp.strip
      #en_desc = word.xpath('td[2]/node()[not(self::strong)]').text
      puts "#{en_word}"
      puts "#{ee_trans}"
      puts "*******************************"
      i += 1
      #writing to csv 
      CSV.open("words/#{name}", "ab") do |row| # write to CSV
          row << [
          en_word,
          #en_desc,
          ee_trans,
          #ee_desc
        ]
      end
    end
  end
end
end

2 个答案:

答案 0 :(得分:1)

你可能正在寻找xpath - 唯一的解决方案,但这里是使用ruby枚举器的那个:

xml.xpath('//td').children.inject({}) do |memo, node|
  case node.name
  when 'b' then memo["#{node.children.first}"] = ""
  when 'text' 
    memo["#{memo.keys.last}"] << "#{node}" unless memo.length.zero?
  else # just skip
  end 

  memo
end

这给出了:

#⇒ {
#  "word 1" => "\n        \"Text I need 1\"\n        ",
#  "word 2" => "\n        \"Text I need 2\"\n        ",
#  "word 3" => "\n        \"Text I need 3\"\n        ",
#  "word 4" => "\n        \"Text I need 41\"\n        \n        \"Text I need 42\"\n        ",
#  "word 5" => "\n        \"Text I need 51\"\n        \n        \"Text I need 52\"\n        \n        ",
#  "word 6" => "\n\n    "
# }

希望它可能会有所帮助。

答案 1 :(得分:1)

我减少了你的HTML以减少冗长。没有额外的文字,它可以达到同样的效果。

我这样做:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<tr class="level2">
    <td> 
        <b>word</b>
        "Text I need"
        <b>word</b>
        "Text I need"
        <i>blabla</i>
        "Text I need"
        <b>word</b>
        "Text I need"
        <i>blabla</i>
        "Text I need"
        <i>blabla</i>
        <b>word</b>
    </td>
</tr>
EOT

doc.search('td i').remove

由于不需要<i>个节点,只需剥离它们即可。结果doc看起来像:

puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <tr class="level2">
# >>     <td> 
# >>         <b>word</b>
# >>         "Text I need"
# >>         <b>word</b>
# >>         "Text I need"
# >>         
# >>         "Text I need"
# >>         <b>word</b>
# >>         "Text I need"
# >>         
# >>         "Text I need"
# >>         
# >>         <b>word</b>
# >> 
# >>     </td>
# >> </tr>
# >> </body></html>

<i>个节点消失后,可以迭代<td>的内容并处理其文字:

text = doc.at('td').children.reject { |n| n.text.strip == '' }.slice_before { |n| n.name == 'b' }.map{ |a| a.map { |n| n.text.strip }}

此时text包含:

text
# => [["word", "\"Text I need\""],
#     ["word", "\"Text I need\"", "\"Text I need\""],
#     ["word", "\"Text I need\"", "\"Text I need\""],
#     ["word"]]

请注意,这里有一个&#34;字&#34;,它模仿您提供的示例HTML。如果您知道自己没有想要保留的任何尾随文本,那么您可以简单地pop关闭该元素。如果您认为有些元素只是单个项目,您可以遍历列表寻找单身并拒绝它们。如何处理它取决于你和你的想法。