Question

我正在解析一些html页面，需要检测里面的任何阿拉伯字符.. 试过各种正则表达式，但没有运气..

有没有人知道这样做的工作方式？

由于

我的代码是：

#!/usr/bin/perl 
use LWP::UserAgent; 
@MyAgent::ISA = qw(LWP::UserAgent); 

# set inheritance 
$ua = LWP::UserAgent->new; 
$q = 'pastie.org/2509936';; 
$request = HTTP::Request->new('GET', $q); 
$response = $ua->request($request); 
if ($response->is_success) { 
    if ($response->content=~/[\p{Script=Arabic}]/g) { 
        print "found arabic"; 
    } else { 
        print "not found"; 
    } 
}

Answer 1

如果您使用的是Perl，则应该能够使用Unicode脚本匹配运算符。 /\p{Arabic}/

如果这不起作用，则必须查找阿拉伯语的Unicode字符范围，并测试类似/[\x{0600}\x{0601}...\x{06FF}]/的内容。

Answer 2

编辑（因为我显然已经徘徊在tchrist的专业领域）。使用$response->content跳过，它始终返回一个原始字节字符串，并使用$response->decoded_content，它应用从响应头中获取的任何解码提示。

您正在下载的页面是UTF-8编码的，但您没有将其作为UTF-8读取（公平地说，页面上没有关于编码是什么的提示 [更新：服务器确实返回标题Content-Type: text/html; charset=utf-8，但是）。

如果你检查$response->content：

，你可以看看是否这样

use List::Util 'max';
my $max_ord = max map{ord}split //, $response->content;
print "max ord of response content is $max_ord\n";

如果您获得的值小于256，那么您将以原始字节的形式读取此内容，并且您的字符串将永远不会与/\p{Arabic}/匹配。在应用正则表达式之前，必须将输入解码为UTF-8：

use Encode;
my $content = decode('utf-8', $response->content);
# now check  $content =~ /\p{Arabic}/

有时（现在我在专业领域之外趟过），您加载的页面包含有关如何解码的提示，$response->content可能已经正确解码。在这种情况下，上面的decode调用是不必要的，可能有害。有关检测任意字符串的编码，请参阅other SO posts。

Answer 3

仅为了记录，至少在.NET regexp中，您需要使用\p{IsArabic}。

如何使用perl正则表达式检测阿拉伯语字符？

3 个答案: