Question

我想帮助解析Ruby中的文本。

假设：

@BreakingNews：台风莫拉克击中台湾，中国撤离了数千人 http://news.bnonews.com/u4z3

我想删除所有超链接，返回纯文本。

@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer 1

foo = "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"
r = foo.gsub(/http:\/\/[\w\.:\/]+/, '')
puts r
# @BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer 2

这是一个古老而又好的问题。这是一个依赖于Ruby的内置URI的答案：

require 'set'
require 'uri'

text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'

schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i

URI.extract(text).each do |url|
  text.gsub!(url, '') if (url[schemes_regex])
end

puts text.squeeze(' ')

通过IRB传递显示正在发生的事情以及由此产生的结果：

我定义了要搜索的文本：

irb(main):004:0* text = '@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3'
=> "@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands http://news.bnonews.com/u4z3"

我定义了一个我们想要做出反应的URI方案的正则表达式。这是一种防御性移动，因为URI在其搜索步骤中返回误报：

irb(main):006:0* schemes_regex = /^(?:#{ URI.scheme_list.keys.join('|') })/i
=> /^(?:FTP|HTTP|HTTPS|LDAP|LDAPS|MAILTO)/i

让URI遍历文本查找URL。对于找到的每一个，如果它是我们想要做出反应的方案，请从文本中删除所有出现的内容：

irb(main):008:0* URI.extract(text).each do |url|
irb(main):009:1*   text.gsub!(url, '') if (url[schemes_regex])
irb(main):010:1> end

这些是找到的URI.extract个网址。由于尾随BreakingNews:，它错误地报告了:。我认为它不太复杂，但对于正常使用它很好：

=> ["BreakingNews:", "http://news.bnonews.com/u4z3"]

显示生成的文本：

irb(main):012:0* puts text.squeeze(' ')
@BreakingNews: Typhoon Morakot hits Taiwan, China evacuates thousands

Answer 3

可以快速，肮脏的方式或以复杂的方式完成。我正在展示复杂的方式：

require 'rubygems'
require 'hpricot' # you may need to install this gem
require 'open-uri'

## first getting the embeded/framed html file's url
start_url = 'http://news.bnonews.com/u4z3'
doc = Hpricot(open(start_url))
news_html_url = doc.at('//link[@href]').to_s.match(/(http[^"]+)/) 

## now getting the news text, its in the 3rd <p> tag of the framed html file
doc2 = Hpricot(open(news_html_url.to_s))
news_text = doc2.at('//p[3]').to_plain_text
puts news_text

尝试了解代码在每个步骤中执行的操作。并将这些知识应用到您未来的项目中。从这些页面获取帮助：

http://wiki.github.com/why/hpricot/an-hpricot-showcase

http://code.whytheluckystiff.net/doc/hpricot/

如何从文本中删除网址？

3 个答案: