我有一些我一直使用的简单文本处理脚本,我想将这些脚本翻译成Ruby,以熟悉该语言。
这是我无法运行的第一个脚本:
#!/usr/bin/env ruby
@text = ARGF.read
@replacements = [{:from=>"—", :to=>". "}, {:from=>"ffl", :to=>"ffl"}, {:from=>"ffi", :to=>"ffi"}, {:from=>"fi", :to=>"fi"}, {:from=>"fl", :to=>"fl"}, {:from=>"ff", :to=>"ff"}, {:from=>"æ", :to=>"ae"}, {:from=>"é", :to=>"e"}, {:from=>"Ç", :to=>"s"}, {:from=>"ü", :to=>"u"}, {:from=>"â", :to=>"a"}, {:from=>"ä", :to=>"a"}, {:from=>"à", :to=>"a"}, {:from=>"å", :to=>"a"}, {:from=>"ç", :to=>"s"}, {:from=>"ê", :to=>"e"}, {:from=>"ë", :to=>"e"}, {:from=>"è", :to=>"e"}, {:from=>"ï", :to=>"i"}, {:from=>"î", :to=>"i"}, {:from=>"ì", :to=>"i"}, {:from=>"Ä", :to=>"a"}, {:from=>"Å", :to=>"a"}, {:from=>"É", :to=>"e"}, {:from=>"ô", :to=>"oh"}, {:from=>"ö", :to=>"oe"}, {:from=>"ò", :to=>"o"}, {:from=>"û", :to=>"uu"}, {:from=>"ù", :to=>"u"}, {:from=>"ÿ", :to=>"o"}, {:from=>"Ö", :to=>"o"}, {:from=>"Ü", :to=>"u"}, {:from=>"á", :to=>"ah"}, {:from=>"í", :to=>"ee"}, {:from=>"ó", :to=>"oh"}, {:from=>"ú", :to=>"uu"}, {:from=>"ñ", :to=>"ny"}, {:from=>"Ñ", :to=>"ny"}]
@replacements.each do |pair|
@text.gsub!(/#{pair[:from]}/, pair[:to])
end
puts @text
这是我得到的错误:
/home/alec/.bei/under-boac:5: invalid multibyte char (US-ASCII)
/home/alec/.bei/under-boac:5: invalid multibyte char (US-ASCII)
/home/alec/.bei/under-boac:5: syntax error, unexpected $end, expecting '}'
@replacements = [{:from=>"—", :to=>". "}, {:from=>"ffl"...
^
我将其中的一部分基于“Best practices with STDIN in Ruby?”。
答案 0 :(得分:4)
这是您的基本代码,为了便于阅读而重新格式化:
@replacements = [
{ :from => "—", :to => ". " },
{ :from => "ffl", :to => "ffl" },
{ :from => "ffi", :to => "ffi" },
{ :from => "fi", :to => "fi" },
{ :from => "fl", :to => "fl" },
{ :from => "ff", :to => "ff" },
{ :from => "æ", :to => "ae" },
{ :from => "é", :to => "e" },
{ :from => "Ç", :to => "s" },
{ :from => "ü", :to => "u" },
{ :from => "â", :to => "a" },
{ :from => "ä", :to => "a" },
{ :from => "à", :to => "a" },
{ :from => "å", :to => "a" },
{ :from => "ç", :to => "s" },
{ :from => "ê", :to => "e" },
{ :from => "ë", :to => "e" },
{ :from => "è", :to => "e" },
{ :from => "ï", :to => "i" },
{ :from => "î", :to => "i" },
{ :from => "ì", :to => "i" },
{ :from => "Ä", :to => "a" },
{ :from => "Å", :to => "a" },
{ :from => "É", :to => "e" },
{ :from => "ô", :to => "oh" },
{ :from => "ö", :to => "oe" },
{ :from => "ò", :to => "o" },
{ :from => "û", :to => "uu" },
{ :from => "ù", :to => "u" },
{ :from => "ÿ", :to => "o" },
{ :from => "Ö", :to => "o" },
{ :from => "Ü", :to => "u" },
{ :from => "á", :to => "ah" },
{ :from => "í", :to => "ee" },
{ :from => "ó", :to => "oh" },
{ :from => "ú", :to => "uu" },
{ :from => "ñ", :to => "ny" },
{ :from => "Ñ", :to => "ny" }
]
@replacements.each do |pair|
@text.gsub!( /#{ pair[:from] }/, pair[:to] )
end
这可以简化,并且散列组合成一个大的,并且,不可避免地,应该使用哈希:
# encoding: utf-8
@replacements = {
"—" => ". " ,
"ffl" => "ffl" ,
"ffi" => "ffi" ,
"fi" => "fi" ,
"fl" => "fl" ,
"ff" => "ff" ,
"æ" => "ae" ,
"é" => "e" ,
"Ç" => "s" ,
"ü" => "u" ,
"â" => "a" ,
"ä" => "a" ,
"à" => "a" ,
"å" => "a" ,
"ç" => "s" ,
"ê" => "e" ,
"ë" => "e" ,
"è" => "e" ,
"ï" => "i" ,
"î" => "i" ,
"ì" => "i" ,
"Ä" => "a" ,
"Å" => "a" ,
"É" => "e" ,
"ô" => "oh" ,
"ö" => "oe" ,
"ò" => "o" ,
"û" => "uu" ,
"ù" => "u" ,
"ÿ" => "o" ,
"Ö" => "o" ,
"Ü" => "u" ,
"á" => "ah" ,
"í" => "ee" ,
"ó" => "oh" ,
"ú" => "uu" ,
"ñ" => "ny" ,
"Ñ" => "ny"
}
@replacements.each do |k,v|
@text.gsub!(k, v)
end
注意:使用“编码注释”来帮助Ruby理解字符的编码。
但是使用相同的哈希,该循环可以减少到这个,运行速度非常快:
@text.gsub!(Regexp.union(@replacements.keys), @replacements)
答案 1 :(得分:2)
如果您在源文件中使用非7位ASCII,则需要添加标题以标识所使用的字符集。例如:
#!/usr/bin/env ruby
# encoding: utf-8
此外,你在这里做的事情非常低效,如果你构造了正确的正则表达式,你可以在O(1)中执行此操作时需要O(N)传递:
replace_map = Hash[@replacements.collect { |r| [ r[:from], r[:to] ] }]
replace_regex = Regexp.new("(#{replace_map.keys.collect { |r| Regexp.escape(r) }.join('|')})")
@text.gsub!(replace_regexp) do |s|
replace_map[s[1]]
end
使用键/值映射而不是您在那里的奇怪:from
/ :to
配对会更容易。
答案 2 :(得分:2)
#encoding: utf-8
@text = "öbñ"
@replacements = [{:from=>"—", :to=>". "}, {:from=>"ffl", :to=>"ffl"}, {:from=>"ffi", :to=>"ffi"}, {:from=>"fi", :to=>"fi"}, {:from=>"fl", :to=>"fl"}, {:from=>"ff", :to=>"ff"}, {:from=>"æ", :to=>"ae"}, {:from=>"é", :to=>"e"}, {:from=>"Ç", :to=>"s"}, {:from=>"ü", :to=>"u"}, {:from=>"â", :to=>"a"}, {:from=>"ä", :to=>"a"}, {:from=>"à", :to=>"a"}, {:from=>"å", :to=>"a"}, {:from=>"ç", :to=>"s"}, {:from=>"ê", :to=>"e"}, {:from=>"ë", :to=>"e"}, {:from=>"è", :to=>"e"}, {:from=>"ï", :to=>"i"}, {:from=>"î", :to=>"i"}, {:from=>"ì", :to=>"i"}, {:from=>"Ä", :to=>"a"}, {:from=>"Å", :to=>"a"}, {:from=>"É", :to=>"e"}, {:from=>"ô", :to=>"oh"}, {:from=>"ö", :to=>"oe"}, {:from=>"ò", :to=>"o"}, {:from=>"û", :to=>"uu"}, {:from=>"ù", :to=>"u"}, {:from=>"ÿ", :to=>"o"}, {:from=>"Ö", :to=>"o"}, {:from=>"Ü", :to=>"u"}, {:from=>"á", :to=>"ah"}, {:from=>"í", :to=>"ee"}, {:from=>"ó", :to=>"oh"}, {:from=>"ú", :to=>"uu"}, {:from=>"ñ", :to=>"ny"}, {:from=>"Ñ", :to=>"ny"}]
# Hammer the @replacements into one Hash like {"—"=>". ", "ffl"=>"ffl"}:
from_to = Hash[@replacements.map{|h| h.values}]
# Generate one Regular expression to catch all keys:
re = Regexp.union(from_to.keys)
# Let gsub do the work in one pass:
@text.gsub!(re, from_to)
这与@ tadman的代码大致相同。 #encoding: utf-8
应解决您的问题;其余的行阻止扫描文本38次。
答案 3 :(得分:0)
你甚至可以简化:
@replacements=Hash[*%w[— . ffl ffl ffi ffi fi fi fl fl ff ff æ ae é e Ç s ü u â a ä a à a å a ç s ê e ë e è e ï i î i ì i Ä a Å a É e ô oh ö oe ò o û uu ù u ÿ o Ö o Ü u á ah í ee ó oh ú uu ñ ny Ñ ny]]