检索标签之间的文本

时间:2011-03-30 17:50:55

标签: ruby

我需要创建一个正则表达式来获取所有的东西 包含在两个标记之间,这两个标记是或者在此标记之间可以有多行。对于 例如:

<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...

每个块标记都标记了一个新块的开头,我尝试了以下内容 正则表达式,但我有点失去了如何指定任何事情可以去 这些括号之间包括多行,以及如何指定 一旦它到达另一个标记,它就需要停止检索

<block color="crimson">(\w+)|<block color="green">(\w+)

woops我忘记添加虽然我对看起来像以下的块不感兴趣:

<block color="purple">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...

4 个答案:

答案 0 :(得分:4)

我不建议您使用正则表达式。首先看看你是否可以通过添加结束标记来使内容成为有效的HTML。然后使用像nokogiri这样的东西,这是一个教程:

http://nokogiri.org/tutorials/parsing_an_html_xml_document.html

即使你不能清理HTML,我也会给nokogiri一个机会,它之前已经为我提供了一些非常破旧的HTML。

祝你好运!

答案 1 :(得分:2)

使用正则表达式解析HTML除了最琐碎,受控制的情况外,还会遇到麻烦。解析器更强大,从长远来看,通常更容易维护。

HTML无效,因为<block>标记未终止。这会导致使用Nokogiri进行模糊解析,但是,我们可以在它上面做一个小技巧来修复问题,然后能够正确解析它:

html =<<EOT
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html.gsub('<block', '</block><block'))
pp doc.search('block').map { |n| n.text }

>> ["\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...      \n",
>>  "\n        This is the text I need and\n        it may also  have other \n        tags in it, and all sorts of \n        things...\n"]

通过搜索并替换结束</block>可以插入所有<block>标记的前面。这导致第一次出现错误,但所有其他都足够接近Nokogiri对HTML的修正将是明智的。这是修复后HTML的样子:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
</block><block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
</block>
</body></html>

此时,Nokogiri可以理解文档并搜索各个块。我正在使用CSS访问器,所以如果你需要更好的粒度,你可以微调CSS,或转而使用XPath。

答案 2 :(得分:1)

str = %q(<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="blue">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...      
<block color="green">
        This is the text I need and
        it may also <p> have other </p>
        tags in it, and all sorts of 
        things...)

ar = str.split(/<block color="\w+">\n/)
ar.shift #(to get rid of the empty element)

答案 3 :(得分:0)

执行此任务的一种简单方法可能是逐行读取,查看行是否以该行开头