如何使用Ruby搜索单词?

时间:2014-03-15 00:54:46

标签: ruby-on-rails ruby

我有一个名为oferson of interest的节目。

在我的代码中,我试图将其拆分为单个单词,然后将每个单词的首字母标题化,然后将它们连接在一起,每个单词之间有一个空格,然后变为:Oferson Of Interest。然后,我想搜索单词Of并将其替换为小写。

我似乎无法弄清楚的问题是,在程序结束时我得到oferson of Interest这不是我想要的。我只是希望“of”这个词是小写而不是“Oferson”这个词的第一个字母,简单地说我想要Oferson of Interest而不是oferson of Interest的输出。

我怎样才能搜索单词'of'而不是句子中'o'和'f'的每个字母?

mine = 'oferson of interest'.split(' ').map {|w| w.capitalize }.join(' ')
 if mine.include? "Of"
   mine.gsub!(/Of/, 'of')
else
  puts 'noting;'
end

puts mine

2 个答案:

答案 0 :(得分:1)

最简单的答案是在正则表达式中使用单词边界:

str = "oferson of interest".split.collect(&:capitalize).join(" ")
str.gsub!(/\bOf\b/i, 'of')
# => Oferson of Interest

答案 1 :(得分:0)

您正在处理“stop words”:您出于某种原因不想处理的字词。构建一个您要忽略的停用词列表,并将每个单词与它们进行比较,看看是否要对其进行进一步处理:

require 'set'

STOPWORDS = %w[a for is of the to].to_set
TEXT = [
  'A stitch in time saves nine',
  'The quick brown fox jumped over the lazy dog',
  'Now is the time for all good men to come to the aid of their country'
]

TEXT.each do |text|
  puts text.split.map{ |w|
    STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
  }.join(' ')
end
# >> a Stitch In Time Saves Nine
# >> the Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country

这是一个简单的例子,但展示了基础知识。在现实生活中,你会想要处理标点符号,比如连字符。

我使用了Set,因为随着停用词列表的增长,它非常快;它类似于Hash,因此检查比在数组上使用include?更快:

require 'set'
require 'fruity'

LETTER_ARRAY = ('a' .. 'z').to_a
LETTER_SET = LETTER_ARRAY.to_set

compare do

  array {LETTER_ARRAY.include?('0') }
  set { LETTER_SET.include?('0') }
end
# >> Running each test 16384 times. Test will take about 2 seconds.
# >> set is faster than array by 10x ± 0.1

当你想要保护结果字符串的第一个字母时,它会变得更有趣,但简单的诀窍是如果重要的话就强制将该字母重写为大写字母:

require 'set'

STOPWORDS = %w[a for is of the to].to_set
TEXT = [
  'A stitch in time saves nine',
  'The quick brown fox jumped over the lazy dog',
  'Now is the time for all good men to come to the aid of their country'
]

TEXT.each do |text|
  str = text.split.map{ |w|
    STOPWORDS.include?(w.downcase) ? w.downcase : w.capitalize
  }.join(' ')
  str[0] = str[0].upcase
  puts str
end
# >> A Stitch In Time Saves Nine
# >> The Quick Brown Fox Jumped Over the Lazy Dog
# >> Now is the Time for All Good Men to Come to the Aid of Their Country

除非您处理非常一致的文本模式,否则这对正则表达式来说不是一个好任务。既然你正在研究电视节目的名称,那么你可能不会发现很多一致性,你的模式会很快变得复杂。