如何在Perl中使用正则表达式验证日语数字?

时间:2013-01-30 12:42:29

标签: regex perl unicode

下面我在Perl中附加我的脚本。我用日语中的一个等效测试1234号。 (我从维基百科复制了......也许它不是100%正确的。)

使用

\p{decimal number}+
\p{Number}+
\d+

代码适用于ASCII版本,但对于日语我只找到这个例子:

[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]

在这种情况下我做错了什么?

use 5.016;

use utf8;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use Test::More tests => 2;

sub is_valid {
  my $string = shift;

  $string ~~ /^[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]+$/u

  #/\p{decimal number}+/msx
}

ok(is_valid("1234"), "ascii");
ok(is_valid("壱弐参四"), "japanese");

1 个答案:

答案 0 :(得分:0)

您的代码在v5.14上传递给我。

由于您在模式中只有ASCII,因此/u没有按照您的想法执行操作。你需要v5.16,并且在v5.14中出现了。除非你试图使用一些v5.16增强功能,否则没有大呐喊。

许多人都注意到,数字和数字之间存在语义差异。我想你只想匹配一组数字。问题是UCS没有将要匹配的字符标记为数字。

因此,您创建了一个非常广泛的角色类来做到这一点。我认为你坚持这一点。你可能不想继续这样做。您可以在子例程中隐藏它,但您也可以定义其他属性。您创建一个特别命名的子例程,该子例程返回一个字符串,其字符范围为十六进制值。以下是perlunicode的示例:

sub InKana {
    return <<END;
3040\t309F
30A0\t30FF
END
}

您可以使用Unicode::Unihan模块找出您想要的点数。您可以使用代码执行此操作,但所有这一切都在查找与该方法同名的Unihan database文件。实际上懂日语的人必须调整它以选择正确的字符:

use v5.10;

use Number::Range;
use Unicode::Unihan;

my $db = Unicode::Unihan->new;
my $range = Number::Range->new;

foreach my $u ( 0 .. 0x01dfff ) {
    my $char = chr $u;
    next unless $char =~ /\p{Script: Han}/;
    my $value = 
        $db->PrimaryNumeric( $char ) ||
        $db->AccountingNumeric( $char ) ||
        $db->OtherNumeric( $char )
        ;
    next unless defined $value;
    my $hex = sprintf "%X", $u;
    say chr($u), " (U+$hex) has numeric value: ", $value;
    $range->addrange( $u );
    }

my $sub = 
q(sub InJapaneseDigit {
    return <<'HERE';
)

.

join( "\n", 
    map { 
        join "\t", 
            map { sprintf "%X", $_ } 
            split /\.\./;  
        } 
    split /,/, $range->range 
    )

.

qq(\nHERE\n});

say $sub;

该计划输出:

㐅 (U+3405) has numeric value: 5
㒃 (U+3483) has numeric value: 2
㠪 (U+382A) has numeric value: 5
㭍 (U+3B4D) has numeric value: 7
一 (U+4E00) has numeric value: 1
七 (U+4E03) has numeric value: 7
万 (U+4E07) has numeric value: 10000
三 (U+4E09) has numeric value: 3
九 (U+4E5D) has numeric value: 9
二 (U+4E8C) has numeric value: 2
五 (U+4E94) has numeric value: 5
亖 (U+4E96) has numeric value: 4
亿 (U+4EBF) has numeric value: 100000000
什 (U+4EC0) has numeric value: 10
仟 (U+4EDF) has numeric value: 1000
仨 (U+4EE8) has numeric value: 3
伍 (U+4F0D) has numeric value: 5
佰 (U+4F70) has numeric value: 100
億 (U+5104) has numeric value: 100000000
兆 (U+5146) has numeric value: 1000000000000
兩 (U+5169) has numeric value: 2
八 (U+516B) has numeric value: 8
六 (U+516D) has numeric value: 6
十 (U+5341) has numeric value: 10
千 (U+5343) has numeric value: 1000
卄 (U+5344) has numeric value: 20
卅 (U+5345) has numeric value: 30
卌 (U+534C) has numeric value: 40
叁 (U+53C1) has numeric value: 3
参 (U+53C2) has numeric value: 3
參 (U+53C3) has numeric value: 3
叄 (U+53C4) has numeric value: 3
四 (U+56DB) has numeric value: 4
壱 (U+58F1) has numeric value: 1
壹 (U+58F9) has numeric value: 1
幺 (U+5E7A) has numeric value: 1
廾 (U+5EFE) has numeric value: 9
廿 (U+5EFF) has numeric value: 20
弌 (U+5F0C) has numeric value: 1
弍 (U+5F0D) has numeric value: 2
弎 (U+5F0E) has numeric value: 3
弐 (U+5F10) has numeric value: 2
拾 (U+62FE) has numeric value: 10
捌 (U+634C) has numeric value: 8
柒 (U+67D2) has numeric value: 7
漆 (U+6F06) has numeric value: 7
玖 (U+7396) has numeric value: 9
百 (U+767E) has numeric value: 100
肆 (U+8086) has numeric value: 4
萬 (U+842C) has numeric value: 10000
貮 (U+8CAE) has numeric value: 2
貳 (U+8CB3) has numeric value: 2
贰 (U+8D30) has numeric value: 2
阡 (U+9621) has numeric value: 1000
陆 (U+9646) has numeric value: 6
陌 (U+964C) has numeric value: 100
陸 (U+9678) has numeric value: 6

sub InJapaneseDigit {
        return <<'HERE';
3405
3483
382A
3B4D
4E00
4E03
4E07
4E09
4E5D
4E8C
4E94
4E96
4EBF    4EC0
4EDF
4EE8
4F0D
4F70
5104
5146
5169
516B
516D
5341
5343    5345
534C
53C1    53C4
56DB
58F1
58F9
5E7A
5EFE    5EFF
5F0C    5F0E
5F10
62FE
634C
67D2
6F06
7396
767E
8086
842C
8CAE
8CB3
8D30
9621
9646
964C
9678
HERE
}