在没有Rails的情况下替换Ruby 1.9.3中的重音字符

时间:2012-06-18 21:40:16

标签: ruby character-encoding

我想使用Ruby 1.9.3用它们的ASCII等价替换重音的UTF-8字符。例如,

Acsády  -->  Acsady

传统的方法是使用IConv包,它是Ruby标准库的一部分。你可以这样做:

str = 'Acsády'
IConv.iconv('ascii//TRANSLIT', 'utf8', str)

哪个会产生

Acsa'dy

然后必须删除撇号。虽然这个方法仍然可以在Ruby 1.9.3中使用,但我收到警告说IConv is deprecated and that String#encode should be used instead。但是,String#encode并未提供完全相同的功能。默认情况下,未定义的字符会抛出异常,但您可以通过以下任一设置来处理它们:undef =>:replace(用默认的'?'char替换未定义的字符)或者:fallback选项用于映射未定义的源编码字符的散列目标编码。我想知道是否有标准的:标准库或通过某些gem可用的后备哈希,这样我就不必编写自己的哈希来处理所有可能的重音符号。

@ raina77ow: 谢谢你的回复。这正是我想要的。但是,在查看链接到的线程之后,我意识到更好的解决方案可能是简单地将非重音字符与其重音等效字符匹配,就像数据库使用字符集排序规则一样。 Ruby有没有相当于整理的东西?

3 个答案:

答案 0 :(得分:3)

我用这个:

def convert_to_ascii(s)
  undefined = ''
  fallback = { 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A',
               'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 'È' => 'E', 'É' => 'E',
               'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I',
               'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O',
               'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U',
               'Û' => 'U', 'Ü' => 'U', 'Ý' => 'Y', 'à' => 'a', 'á' => 'a',
               'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae',
               'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
               'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n',
               'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o',
               'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u',
               'ý' => 'y', 'ÿ' => 'y' }
  s.encode('ASCII',
           fallback: lambda { |c| fallback.key?(c) ? fallback[c] : undefined })
end

你可以check for other symbols you might want to provide fallback for here

答案 1 :(得分:0)

我认为你所寻找的与this question类似。如果是,你可以使用为Ruby编写的Text :: Unidecode的端口 - 例如this gem(或它的this分叉,看起来它已准备好在1.9中使用)。

答案 2 :(得分:0)

以下代码适用于各种欧洲语言,包括希腊语,这很难正确理解,而且前面的答案没有处理。

# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿΆΈΊΌΐάέήίΰϊϋόύώỏἀἁἂἃἄἅἆἈἉἊἌἍἎἐἑἒἓἔἕἘἙἜἝἠἡἢἣἤἥἦἧἨἩἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἼἽἾὀὁὂὃὄὅὈὉὊὋὌὍὐὑὓὔὕὖὗὙὝὠὡὢὣὤὥὦὧὨὩὫὬὭὮὯὰὲὴὶὸὺὼᾐᾑᾓᾔᾕᾖᾗᾠᾤᾦᾧᾰᾱᾳᾴᾶᾷᾸᾹῂῃῄῆῇῐῑῒῖῗῘῙῠῡῢῥῦῨῩῬῳῴῶῷῸ","AAAAAAÆCEEEEIIIINOOOOOOUUUUYaaaaaaæceeeeiiiinoooooouuuuyyΑΕΙΟιαεηιυιυουωoαααααααΑΑΑΑΑΑεεεεεεΕΕΕΕηηηηηηηηΗΗΗΗΗΗΗιιιιιιιιΙΙΙΙΙοοοοοοΟΟΟΟΟΟυυυυυυυΥΥωωωωωωωωΩΩΩΩΩΩΩαεηιουωηηηηηηηωωωωααααααΑΑηηηηηιιιιιΙΙυυυρυΥΥΡωωωωΟ")
end

它是由以下又长又慢的程序生成的,该程序会执行 linux 命令行实用程序“unicode”。如果您遇到此列表中缺少的字符,请将它们添加到较长的程序中,重新运行它,您将获得处理这些字符的代码输出。例如,我认为该列表缺少一些出现在捷克语中的字符,例如带有楔形的 c,以及带有长音符号的拉丁语言元音。如果这些新字符的重音不在下面的列表中,程序将不会删除它们,直到您将新重音的名称添加到 names_of_accents

$stderr.print %q{
This program generates ruby code to strip accents from characters in Latin and Greek scripts.
Progress will be printed to stderr, the final result to stdout.
}

all_characters = %q{
         ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿ
         ΆΈΊΌΐάέήίϊόύώỏἀἁἃἄἅἈἐἑἒἔἕἘἙἜἡἢἣἤἥἦἨἩἫἬἮἰἱἲἴἵἶἸὀὁὂὃὄὅὊὍὐὑὓὔὕὖὗὝὡὢὣὤὥὧὨὩὰὲὴὶὸὺὼᾐᾗᾳᾴᾶῂῆῇῖῥῦῳῶῷῸᾤᾷἂἷ
         ὌᾖὉἧἷἂῃἌὬὉἷὉἷῃὦἌἠἳᾔἉᾦἠἳᾔὠᾓὫἝὈἭἼϋὯῴἆῒῄΰῢἆὙὮᾧὮᾕὋἍἹῬἽᾕἓἯἾᾠἎῗἾῗἯἊὭἍᾑᾰῐῠᾱῑῡᾸῘῨᾹῙῩ
}.gsub(/\s/,'')
# The first line is a list of accented Latin characters. The second and third lines are polytonic Greek.
# The Greek on this list includes every character occurring in the Project Gutenberg editions of Homer, except for some that seem to be
# mistakes (smooth rho, phi and theta in symbol font). Duplications and characters out of order in this list have no effect at run time.
# Also includes vowels with macron and vrachy, which occur in Project Perseus texts sometimes.

# The following code shells out to the linux command-line utility called "unicode," which is installed as the debian package
# of the same name.
# Documentation: https://github.com/garabik/unicode/blob/master/README

names_of_accents = %q{
  acute grave circ and rough smooth ypogegrammeni diar with macron vrachy tilde ring above diaeresis cedilla stroke
  tonos dialytika hook perispomeni dasia varia psili oxia
}.split(/\s+/).select { |x| x.length>0}.sort.uniq
# The longer "circumflex" will first be shortened to "circ" in later code.

def char_to_name(c)
  return `unicode --string "#{c}" --format "{name}"`.downcase
end

def name_to_char(name)
   list = `unicode "#{name}" --format "{pchar}" --max 0` # returns a string of possibilities, not just exact matches
   # Usually, but not always, the unaccented character is the first on the list.
   list.chars.each { |c|
     if char_to_name(c)==name then return c end
   }
   raise "Unable to convert name #{name} to a character, list=#{list}."
end

regex = "( (#{names_of_accents.join("|")}))+"
from = ''
to = ''
all_characters.chars.sort.uniq.each { |c|
  name = char_to_name(c).gsub(/circumflex/,'circ')
  name.gsub!(/#{regex}/,'')
  without_accent = name_to_char(name)
  from = from+c.unicode_normalize(:nfc)
  to = to+without_accent.unicode_normalize(:nfc)
  $stderr.print c
}
$stderr.print "\n"
print %Q{
# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
  return s.unicode_normalize(:nfc).tr("#{from}","#{to}")
end
}