Question

我只是在红宝石中用nokogiri将网站包围起来时遇到了一些问题。

以下是网站的内容

<div id="post_message_111112" class="postcontent">

        Hee is text 1 
     here is another
      </div>
<div id="post_message_111111" class="postcontent">

            Here is text 2
    </div>

这是我解析它的代码

 doc = Nokogiri::HTML(open(myNewLink))
 myPost = doc.xpath("//div[@class='postcontent']/text()").to_a()

ii=0

 while ii!=myPost.length
     puts "#{ii}  #{myPost[ii].to_s().strip}"
   ii+=1
 end

我的问题是当它显示它时，由于Hee is text 1之后的新行，to_a使它变得如此奇怪

myPost[0] = hee is text 1
myPost[1] = here is another
myPost[2] = here is text 2

我希望每个div都是自己的消息。像

myPost[0] = hee is text 1 here is another
myPost[1] = here is text 2

我将如何解决这个问题

已更新

我试过

 myPost = doc.xpath("//div[@class='postcontent']/text()").to_a()

myPost.each_with_index do |post, index|
  puts "#{index}  #{post.to_s().gsub(/\n/, ' ').strip}"
end

我把post.to_s（）。gsub，因为它抱怨gsub不是post的方法。但我仍然有同样的问题。我知道我做错了只是破坏了我的头脑

更新2

忘记说新行是<br />，甚至是

   doc.search('br').each do |n|
  n.replace('')
end

或

doc.search('br').remove

问题仍然存在

Answer 1

如果查看myPost数组，您会看到每个div实际上都是自己的消息。第一个恰好包含换行符\n。要用空格替换它，请使用#gsub(/\n/, ' ')。所以你的循环看起来像这样：

myPost.each_with_index do |post, index|
    puts "#{index}  #{post.to_s.gsub(/\n/, ' ').strip}"
end

修改

根据我对它的有限理解，xpath只能找到节点。子节点为<br />，因此您之间可能有多个文本，或者您的搜索中包含div标记。肯定有一种方法可以在<br />节点之间加入文本，但我不知道。直到你找到它，这里有一些有用的东西：

将您的xpath匹配替换为"//div[@class='postcontent']"

调整循环以删除div标记：

myPost.each_with_index do |post, index| post = post.to_s post.gsub!(/\n/, ' ') post.gsub!(/^<div[^>]*>/, '') # delete opening div tag post.gsub!(%r|</\s*div[^>]*>|, '') # delete closing div tag puts "#{index} #{post.strip}" end

Answer 2

在这里，让我为你清理一下：

doc.search('div.postcontent').each_with_index do |div, i|
  puts "#{i} #{div.text.gsub(/\s+/, ' ').strip}"
end
# 0 Hee is text 1 here is another
# 1 Here is text 2

ruby解析的问题

2 个答案: