在Perl中对utf8字符串使用“sort”

时间:2017-10-07 05:50:55

标签: perl utf-8

我正在试图弄清楚如何在Perl中按字母顺序对数组进行排序。以下是我用英语做得很好的东西:

   # List of countries (kept like this to keep clean, as its re-used in other places)
    my $countries = {
        'AT' => "íAustria",
        'AU' => "Australia",
        'BE' => "Belgium",
        'BG' => "Bulgaria",
        'CA' => "Canada",
        'CY' => "Cyprus",
        'CZ' => "Czech Republic",
        'DK' => "Denmark",
        'EN' => "England",
        'EE' => "Estonia",
        'FI' => "Finland",
        'FR' => "France",
        'DE' => "Germany",
        'GB' => "Great Britain",
        'GR' => "Greece",
        'HU' => "Hungary",
        'IE' => "Ireland",
        'IT' => "Italy",
        'LV' => "Latvia",
        'LT' => "Lithuania",
        'LU' => "Luxembourg",
        'MT' => "Malta",
        'NZ' => "New Zealand",
        'NL' => "Netherlands",
        'PL' => "Poland",
        'PT' => "Portugal",
        'RO' => "Romania",
        'SK' => "Slovakia",
        'SI' => "Slovenia",
        'ES' => "Spain",
        'SE' => "Sweden",
        'CH' => "Switzerland",
        'SC' => "Scotland",
        'UK' => "United Kingdom",
        'US' => "USA",
        'TK' => "Turkey",
        'NO' => "Norway",
        'MX' => "Mexico",
        'IL' => "Israel",
        'IN' => "India",
        'IS' => "Iceland",
        'CN' => "China",
        'JP' => "Japan",
        'VN' => "áVietnamí"
    };
   # Populate the original loop with "name" and "code"
    my @country_loop_orig;
    print $IN->header;
    foreach (keys %{$countries}) {
      push @country_loop_orig, {
        name => $countries->{$lang}->{$_},
        code => $_
      }
    }

   # sort it alphabetically
   my @country_loop = sort { lc($a->{name}) cmp lc($b->{name})  } @country_loop_orig;

这适用于英文版本:

Australia
Austria
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
Vietnam

...但是当你尝试使用f等等utf8时,它不起作用:

Australia
Belgium
Bulgaria
Canada
China
Cyprus
Czech Republic
Denmark
England
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
Iceland
India
Ireland
Israel
Italy
Japan
Latvia
Lithuania
Luxembourg
Malta
Mexico
Netherlands
New Zealand
Norway
Poland
Portugal
Romania
Scotland
Slovakia
Slovenia
Spain
Sweden
Switzerland
Turkey
United Kingdom
USA
áVietnam
íAustria

你是如何实现这一目标的?我发现Sort::Naturally::XS,但无法让它发挥作用。

1 个答案:

答案 0 :(得分:7)

Unicode::Collate  应该有所帮助。

对您的上一个列表进行排序的简单示例

use warnings;
use strict;
use feature 'say';

use Unicode::Collate;

use open ":std", ":encoding(UTF-8)";

open my $fh, '<', "country_list.txt";
my @list = <$fh>;
chomp @list;

my $uc  = Unicode::Collate->new();
my @sorted = $uc->sort(@list);

say for @sorted;

但是,在某些语言中,非ascii字符可能具有非常特殊的可接受位置,并且该问题不会提供任何详细信息。那么也许Unicode::Collate::Locale可能会有帮助。

参见(学习)this perl.com articlethis post(T。Christiansen)和this effectiveperler article

如果要排序的数据在复杂的数据结构中,cmp方法用于个别比较

my @sorted = map { $uc->cmp($a, $b) } @list;