Question

问题陈述 - 我正在处理一些数据文件。在该数据转储中，我有一些包含字符的unicode值的字符串。字符可以是大写和小写。现在我需要对此字符串进行以下处理。

1-如果有 - ，_）（} {] ['“然后删除它们。所有这些字符都以字符串形式出现在其Unicode格式中（$ 4-hexa-digits）

2-所有大写字母都需要转换为小写字母（包括所有不同的unicode字符'Φ' - ＆gt;'φ'，'Ω' - ＆gt;'ω'，'Ž' - ＆gt;'ž '）

3-稍后我将使用此最终字符串来匹配不同的用户输入。

问题详情说明我有一些字符串，例如Buna$002C_Texas , Zamboanga_$0028province$0029等等。

此处$002C, $0028和$0029是unicode值，我正在使用下面将它们转换为字符表示。

$str =~s/\$(....)/chr(hex($1))/eg;

OR

$str =~s/\$(....)/pack 'U4', $1/eg;

现在我按照我的要求替换所有角色。然后我将字符串解码为utf-8以获得所有字符的小写，包括unicode，如下所示，lc直接不支持unicode字符。

$str =~ s/(^\-|\-$|^\_|\_$)//g;                        
$str =~ s/[\-\_,]/ /g;                                                                         
$str =~ s/[\(\)\"\'\.]|ʻ|’|‘//g;                                                                                       
$str =~ s/^\s+|\s+$//g;
$str =~ s/\s+/ /g;
$str = decode('utf-8',$str);
$str = lc($str);
$str = encode('utf-8',$str);

但是当Perl尝试解码字符串时，我遇到了错误。

Cannot decode string with wide characters at /usr/lib64/perl5/5.8.8/x86_64-linux-thread-multi/Encode.pm line 173

此错误也很明显，如此处所述。 @ http://www.perlmonks.org/?node_id=569402

现在我根据上面的网址更改了我的逻辑。我用下面的方法将unicode转换为字符表示。

$str =~s/\$(..)(..)/chr(hex($1)).chr(hex($2))/eg;

但是现在我没有得到字符表示。我得到了一些不可打印的字符。那么当我不知道有多少不同的unicode表示时，如何处理这个问题。

Answer 1

您希望在进行转换之前解码字符串，最好使用像:utf8这样的PerlIO层。因为您在解码之前插入了转义的代码点，所以您的字符串可能已经包含多字节字符。请记住，Perl（貌似）在代码点上运行，而不是字节。

所以我们要做的是以下内容：decode，unescape，normalize，remove，case fold：

 use strict; use warnings;
 use utf8;  # This source file holds Unicode chars, should be properly encoded
 use feature 'unicode_strings'; # we want Unicode semantics everywhere
 use Unicode::CaseFold; # or: use feature 'fc'
 use Unicode::Normalize;

 # implicit decode via PerlIO-layer
 open my $fh, "<:utf8", $file or die ...;
 while (<$fh>) {
   chomp;

   # interpolate the escaped code points
   s/\$(\p{AHex}{4})/chr hex $1/eg;

   # normalize the representation
   $_ = NFD $_;  # or NFC or whatever you like

   # remove unwanted characters. prefer transliterations where possible,
   # as they are more efficient:
   tr/.ʻ//d;
   s/[\p{Quotation_Mark}\p{Open_Punctuation}\p{Close_Punctuation}]//g;  # I suppose you want to remove *all* quotation marks?
   tr/-_,/   /;
   s/\A\s+//;
   s/\s+\z//;
   s/\s+/ /g;

   # finally normalize case
   $_ = fc $_

   # store $_ somewhere.
 }

您可能对perluniprops感兴趣，{{3}}是所有可用Unicode字符属性的列表，例如Quotation_Mark，Punct（标点符号），Dash（破折号 - - - ），Open_Punctuation（类似({[〈的引用和„“等引号）等。

为什么我们执行unicode规范化？一些字形（视觉字符）可以具有多个不同的表示。例如，á可以表示为“a具有急性”或“a”+“结合急性”。 NFC尝试将信息组合到一个代码点，而NFD将这些信息分解为多个代码点。请注意，这些操作会更改字符串的长度，因为长度是在代码点中测量的。

在输出您分解的数据之前，最好重新组合它。

为什么我们使用fc而不是小写来使用大小写折叠？两个小写字符可能是等效的，但不会比较相同，例如希腊小写sigma：σ和ς。表壳折叠使其正常化。德语ß被大写为双字符序列SS。因此，"ß" ne (lc uc "ß")。案例折叠将此规范化，并将ß转换为ss：fc("ß") eq fc(uc "ß")。（但无论你做什么，你仍然可以享受土耳其数据的乐趣。）

将带有unicode字符的字符串转换为小写字母

1 个答案: