使用整理排序马其顿字母表

时间:2016-06-04 10:54:11

标签: java collation

我正在尝试对用马其顿字母书写的一组字符串进行排序。我知道怎么做,但最终结果不是我的预期。这是我的测试程序:

public class Main {

    private static final char[] ALPHABET_ARRAY = {
        'а', 'б', 'в', 'г', 'д', 'ѓ', 'е', 'ж', 'з', 'ѕ', 'и', 'ј', 'к', 'л', 'љ', 'м', 'н', 'њ', 'о', 'п', 'р', 'с', 'т', 'ќ', 'у', 'ф', 'х', 'ц', 'ч','џ', 'ш' };

    public static void main(String[] args) {
        Collator collator = Collator.getInstance(new Locale("mk", "MK"));
        List<String> list = new LinkedList<>();
        for (int i = 0; i < ALPHABET_ARRAY.length; i++) {
            list.add("" + ALPHABET_ARRAY[i]);
        }
        list.sort(collator::compare);
        list.forEach(System.out::print);
    }
}

ALPHABET_ARRAY中的字母按正确的字母顺序排列,但程序会打印

  

абвгѓдежзѕијкќлљмнњопрстуфхцчџш

但应该是:

  

абвгдѓежзѕијклљмнњопрстќуфхцчџш

Java中的马其顿整理者是否存在问题,或者我做错了什么?

1 个答案:

答案 0 :(得分:4)

“mk_MK”区域设置的整理程序基于sun.text.resources.mk.CollationData_mk资源(CollationData_mk.java source in jdk8u repo tagged jdk8u92-b14)。

CollationData_mk中的整理规则明确地将'ѓ'放在'г'之后,'ќ'放在'к'之后。

由于可以使用自定义规则创建RuleBasedCollator,因此获取所需排序顺序的最简单方法是从CollationData_mk稍微修改规则:

public static Collator createMacedonianCollator() throws ParseException {
    // the defaults are defined in non-public sun.util.locale.provider.CollationRules
    // they are used internally in sun.util.locale.provider.CollatorProviderImpl
    // we have no direct access to proper defaults, so we will simply comment entries which depend on them
    String DEFAULTRULES = "";
    // we will move the entries for ѓ and ќ only, leaving everything else as is
    return new RuleBasedCollator( DEFAULTRULES +
            //"& 9 < \u0482 " +       // thousand sign
            //"& Z " +                // Arabic script sorts after Z's
            "< \u0430 , \u0410" +   // a
            "< \u0431 , \u0411" +   // be
            "< \u0432 , \u0412" +   // ve
            "< \u0433 , \u0413" +   // ghe
            "; \u0491 , \u0490" +   // ghe-upturn
            "; \u0495 , \u0494" +   // ghe-mid-hook
            /*!!!moved after д/de!!!*/ //"; \u0453 , \u0403" +   // gje
            "; \u0493 , \u0492" +   // ghe-stroke
            "< \u0434 , \u0414" +   // de
            /*!!!moved AND relation strength changed!!!*/ "< \u0453 , \u0403" +   // gje
            "< \u0452 , \u0402" +   // dje
            "< \u0435 , \u0415" +   // ie
            "; \u04bd , \u04bc" +   // che
            "; \u0451 , \u0401" +   // io
            "; \u04bf , \u04be" +   // che-descender
            "< \u0454 , \u0404" +   // uk ie
            "< \u0436 , \u0416" +   // zhe
            "; \u0497 , \u0496" +   // zhe-descender
            "; \u04c2 , \u04c1" +   // zhe-breve
            "< \u0437 , \u0417" +   // ze
            "; \u0499 , \u0498" +   // zh-descender
            "< \u0455 , \u0405" +   // dze
            "< \u0438 , \u0418" +   // i
            "< \u0456 , \u0406" +   // uk/bg i
            "; \u04c0 " +           // palochka
            "< \u0457 , \u0407" +   // uk yi
            "< \u0439 , \u0419" +   // short i
            "< \u0458 , \u0408" +   // je
            "< \u043a , \u041a" +   // ka
            "; \u049f , \u049e" +   // ka-stroke
            "; \u04c4 , \u04c3" +   // ka-hook
            "; \u049d , \u049c" +   // ka-vt-stroke
            "; \u04a1 , \u04a0" +   // bashkir-ka
            /*!!!moved after т/te!!!*/ //"; \u045c , \u040c" +   // kje
            "; \u049b , \u049a" +   // ka-descender
            "< \u043b , \u041b" +   // el
            "< \u0459 , \u0409" +   // lje
            "< \u043c , \u041c" +   // em
            "< \u043d , \u041d" +   // en
            "; \u0463 " +           // yat
            "; \u04a3 , \u04a2" +   // en-descender
            "; \u04a5 , \u04a4" +   // en-ghe
            "; \u04bb , \u04ba" +   // shha
            "; \u04c8 , \u04c7" +   // en-hook
            "< \u045a , \u040a" +   // nje
            "< \u043e , \u041e" +   // o
            "; \u04a9 , \u04a8" +   // ha
            "< \u043f , \u041f" +   // pe
            "; \u04a7 , \u04a6" +   // pe-mid-hook
            "< \u0440 , \u0420" +   // er
            "< \u0441 , \u0421" +   // es
            "; \u04ab , \u04aa" +   // es-descender
            "< \u0442 , \u0422" +   // te
            "; \u04ad , \u04ac" +   // te-descender
            "< \u045b , \u040b" +   // tshe
            /*!!!movedAND relation strength changed!!!*/ "< \u045c , \u040c" +   // kje
            "< \u0443 , \u0423" +   // u
            "; \u04af , \u04ae" +   // straight u
            "< \u045e , \u040e" +   // short u
            "< \u04b1 , \u04b0" +   // straight u-stroke
            "< \u0444 , \u0424" +   // ef
            "< \u0445 , \u0425" +   // ha
            "; \u04b3 , \u04b2" +   // ha-descender
            "< \u0446 , \u0426" +   // tse
            "; \u04b5 , \u04b4" +   // te tse
            "< \u0447 , \u0427" +   // che
            "; \u04b7 ; \u04b6" +   // che-descender
            "; \u04b9 , \u04b8" +   // che-vt-stroke
            "; \u04cc , \u04cb" +   // che
            "< \u045f , \u040f" +   // dzhe
            "< \u0448 , \u0428" +   // sha
            "< \u0449 , \u0429" +   // shcha
            "< \u044a , \u042a" +   // hard sign
            "< \u044b , \u042b" +   // yeru
            "< \u044c , \u042c" +   // soft sign
            "< \u044d , \u042d" +   // e
            "< \u044e , \u042e" +   // yu
            "< \u044f , \u042f" +   // ya
            "< \u0461 , \u0460" +   // omega
            "< \u0462 " +           // yat
            "< \u0465 , \u0464" +   // iotified e
            "< \u0467 , \u0466" +   // little yus
            "< \u0469 , \u0468" +   // iotified little yus
            "< \u046b , \u046a" +   // big yus
            "< \u046d , \u046c" +   // iotified big yus
            "< \u046f , \u046e" +   // ksi
            "< \u0471 , \u0470" +   // psi
            "< \u0473 , \u0472" +   // fita
            "< \u0475 , \u0474" +   // izhitsa
            "; \u0477 , \u0476" +   // izhitsa-double-grave
            "< \u0479 , \u0478" +   // uk
            "< \u047b , \u047a" +   // round omega
            "< \u047d , \u047c" +   // omega-titlo
            "< \u047f , \u047e" +   // ot
            "< \u0481 , \u0480"     // koppa
    );
}

规则可以进一步简化,只包含没有重音变体的基本31个字母。