从Ruby中的HTML字符串生成HTML页面的原始字符串元描述?

时间:2010-01-03 22:46:51

标签: html ruby parsing seo

我正在寻找一种方法来转换这样的文字:


"  <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n  <html xml:lang=\"en\" lang=\"en\"
 xmlns=\"http://www.w3.org/1999/xhtml\">\n   \t<head>\n   \t\t<title>My Page 
Title</title>\n   \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n      <style type=\"text/css\" media=\"screen\"> \n       \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>&#8220;I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.&#8221;</p>\n</blockquote>\n<p class=\"author\"><cite>- John Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always felt a deep connection to this......"

进入这个:

"  <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n  <html xml:lang=\"en\" lang=\"en\"
 xmlns=\"http://www.w3.org/1999/xhtml\">\n   \t<head>\n   \t\t<title>My Page 
Title</title>\n   \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n      <style type=\"text/css\" media=\"screen\"> \n       \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>&#8220;I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.&#8221;</p>\n</blockquote>\n<p class=\"author\"><cite>- John Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always felt a deep connection to this......"

这只是在第一个时期之前提取所有文本。但它必须:

  • 剥离HTML标记
  • 将\ n替换为“。”(和多个\ n \ n \ n用“。”替换
  • 将\ t替换为“”
  • 将\ s +替换为“”
  • My Page Title. Production Manager. I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.
  • 这样的事情
  • 将“替换为”

在开始做类似的事情后,我认为这可能已经在其他地方得到了更彻底的解决。有没有人有一个很好的单行方式从这样的HTML字符串(在Ruby中)创建纯文本摘录

我使用Nokogiri进行全功能的HTML解析,但似乎使用它也同样困难。

2 个答案:

答案 0 :(得分:0)

它必须是红宝石吗? 我可以用PHP编写它:

$text = '<html> ...';
$result = preg_replace(array('/\\n+/', '/\\[ts]/', '/"/'), array('. ', ' ', '\''), html_entity_decode(strip_tags($text)));

答案 1 :(得分:0)

嗯。这似乎是一个单线程的相当多的功能。如果您只想解析并以纯文本格式显示HTML页面,我建议您使用w3m

string = "..." # your string

IO.popen("w3m -T text/html", "r+") do |pipe|
  pipe.write string
  pipe.close_write
  puts pipe.read
end

给我:

My Page Title

Production Manager

    “I want my passion for business plan and my pride in my work to show in
    every step of our company: from the labels and papers, to our relationships
    with our customers, to the enjoyment of each bottle of My Company business
    plan. As we expand our production, my dream is to plant a company of my own
    to specialize in good business, my personal favorite varietal.”

- John Smith

Born and raised on the north coast of California, John Smith always felt a deep
connection to this......

对于其余的替换,我建议在处理之前或之后应用regexp替换,具体取决于您的确切需要。