Question

我有一个我想要的网站，比如http://www.ru.wikipedia.org/wiki/perl。该网站是俄语，我想提取所有俄语单词。与\w+匹配不起作用，与\p{L}+匹配将检索所有内容。

我该怎么做？

Answer 1

perl -MLWP::Simple -e 'getprint "http://ru.wikipedia.org/wiki/Perl"'
403 Forbidden <URL:http://ru.wikipedia.org/wiki/Perl>

嗯，这没有用！

首先下载副本，这似乎有效：

use Encode;

local $/ = undef;
my $text = decode_utf8(<>);

my @words = ($text =~ /([\x{0400}-\x{04ff}]+)/gs);

foreach my $word (@words) {
  print encode_utf8($word) . "\n";
}

Answer 2

所有这些答案都过于复杂。使用此

$text =~/\p{cyrillic}/

BAM。

Answer 3

好的，然后试试这个：

#!/usr/bin/perl

use strict;
use warnings;
use LWP::UserAgent;

my $ua = LWP::UserAgent->new;

my $response = $ua->get("http://ru.wikipedia.org/wiki/Perl");

die $response->status_line unless $response->is_success;

my $content = $response->decoded_content;

my @russian = $content =~ /\s([\x{0400}-\x{052F}]+)\s/g;

print map { "$_\n" } @russian;

我相信西里尔字符集从0x0400开始，西里尔语补充字符集以0x052F结束，所以这应该得到许多单词。

Answer 4

请留在这里。匹配特定的俄语单词

use utf8;
...
utf8::decode($text);
$text =~ /привет/;

如何使用Perl匹配Unicode文本中的俄语单词？

4 个答案: