我正在寻找一种方法来转换这样的文字:
" <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n <html xml:lang=\"en\" lang=\"en\"
xmlns=\"http://www.w3.org/1999/xhtml\">\n \t<head>\n \t\t<title>My Page
Title</title>\n \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n <style type=\"text/css\" media=\"screen\"> \n \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page
Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>“I want my passion for
business plan and my pride in my work to show in every step of our company: from the
labels and papers, to our relationships with our customers, to the enjoyment of each bottle
of My Company business plan. As we expand our production, my dream is to plant a company
of my own to specialize in good business, my personal favorite
varietal.”</p>\n</blockquote>\n<p class=\"author\"><cite>- John
Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always
felt a deep connection to this......"
进入这个:
" <!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n <html xml:lang=\"en\" lang=\"en\"
xmlns=\"http://www.w3.org/1999/xhtml\">\n \t<head>\n \t\t<title>My Page
Title</title>\n \t\t<META HTTP-EQUIV=\"Content-Type\" CONTENT=\"text/html; charset=ISO-
8859-1\">\n <style type=\"text/css\" media=\"screen\"> \n \t\tblockquote\n
{\n \tfont-style: italic;\n }\n cite\n {\n
\ttext-align: right;\n \tfont-style: normal;\n }\n .author\n {\n \ttext-align: right;\n \tmargin-right: 80px;\n
}\n </style>\n \t</head>\n \t<body>\n \t\t<h1>My Page
Title</h1>\n<h3>Production Manager</h3>\n<blockquote>\n<p>“I want my passion for
business plan and my pride in my work to show in every step of our company: from the
labels and papers, to our relationships with our customers, to the enjoyment of each bottle
of My Company business plan. As we expand our production, my dream is to plant a company
of my own to specialize in good business, my personal favorite
varietal.”</p>\n</blockquote>\n<p class=\"author\"><cite>- John
Smith</cite></p>\n<p>Born and raised on the north coast of California, John Smith always
felt a deep connection to this......"
这只是在第一个时期之前提取所有文本。但它必须:
My Page Title. Production Manager. I want my passion for business plan and my pride in my
work to show in every step of our company: from the labels and papers, to our
relationships with our customers, to the enjoyment of each bottle of My Company business
plan. As we expand our production, my dream is to plant a company of my own to specialize
in good business, my personal favorite varietal.
在开始做类似的事情后,我认为这可能已经在其他地方得到了更彻底的解决。有没有人有一个很好的单行方式从这样的HTML字符串(在Ruby中)创建纯文本摘录?
我使用Nokogiri进行全功能的HTML解析,但似乎使用它也同样困难。
答案 0 :(得分:0)
它必须是红宝石吗? 我可以用PHP编写它:
$text = '<html> ...';
$result = preg_replace(array('/\\n+/', '/\\[ts]/', '/"/'), array('. ', ' ', '\''), html_entity_decode(strip_tags($text)));
答案 1 :(得分:0)
嗯。这似乎是一个单线程的相当多的功能。如果您只想解析并以纯文本格式显示HTML页面,我建议您使用w3m。
string = "..." # your string
IO.popen("w3m -T text/html", "r+") do |pipe|
pipe.write string
pipe.close_write
puts pipe.read
end
给我:
My Page Title Production Manager “I want my passion for business plan and my pride in my work to show in every step of our company: from the labels and papers, to our relationships with our customers, to the enjoyment of each bottle of My Company business plan. As we expand our production, my dream is to plant a company of my own to specialize in good business, my personal favorite varietal.” - John Smith Born and raised on the north coast of California, John Smith always felt a deep connection to this......
对于其余的替换,我建议在处理之前或之后应用regexp替换,具体取决于您的确切需要。