Question

我正在为我的翻译系统使用摩西工具包。我正在使用阿萨姆语和英语平行语料库并训练他们。但是一些专有名词没有被翻译。这是因为我有一个非常小的语料库（并行数据集）。所以我想在翻译系统中使用音译过程。

我正在使用此命令进行翻译：echo'নানাদাএখনএখনিশালদেশ。'| 〜/ mymoses / bin / moses -f~ / work / mert-work / moses.ini

这给了我输出“কানাদা是一个广大的国家”。

这是因为“কানাদা”这个词不在我的平行语料库中。

所以我在阿萨姆语和英语中采用了一些平行的单词列表，并按字符顺序打破每个单词。因此，两个文件中的每一行将具有单个单词，每个单词（或每个音节）之间具有空格。我已经使用这两个文件来训练系统作为正常的翻译任务

然后我使用了以下命令echo'কানাদাএখনএখনিশালদেশ。'| 〜/ mymoses / bin / moses -f~ / work / mert-work / moses.ini | ./space.pl

这给了我输出“কানাদা是一个广阔的国家”

我不得不打破这个词，因为我已经训练了这个系统的性格......

然后我使用了我使用命令训练过的音译系统：

回声'নানাদাএখনএখনিশালদেশ。'| 〜/ mymoses / bin / moses -f~ / work / mert-work / moses.ini | ./space.pl | 〜/ mymoses / bin / moses -f~ / work1 / train / model / moses.ini

这给了我输出“c a n a d a a vast country”

字符是音译的..但唯一的问题是单词之间的空格。所以我想使用一个将加入单词的perl文件。我的最后一个命令是

回声'নানাদাএখনএখনিশালদেশ。'| 〜/ mymoses / bin / moses -f~ / work / mert-work / moses.ini | ./space.pl | 〜/ mymoses / bin / moses -f~ / work1 / train / model / moses.ini | ./join.pl

帮我解决这个“join.pl”文件。

Answer 1

怎么样：

use utf8;
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
say $str;

<强>输出：

ভ া ৰ ত is a famous country. দ ি ল ্ ল ী is the capital of ভ া ৰ ত

您可以在程序中使用它，只需将while循环更改为：

while(<>) {
    s/([\x{0980}-\x{09FF}])(?=[\x{0980}-\x{09FF}])/$1 /g;
    print $_;
}

但我想你会这样做：

my %corresp = (
    'ভ' => 'Bh',
    'া' => 'a',
    'ৰ' => 'ra',
    'ত' => 't',
);
my $str = "ভাৰত is a famous country. দিল্লী is the capital of ভাৰত";
$str =~ s/([\x{0980}-\x{09FF}])/exists($corresp{$1}) ? $corresp{$1} : $1/eg;
say $str;

<强>输出：

Bharat is a famous country. দিল্লী is the capital of Bharat

注意：由您来构建真正的相应哈希值。我对阿萨姆人的角色一无所知。

Answer 2

您可以使用\p{...}和\P{...}来匹配或不匹配perluniprops中指定的特定字符类。

我正在使用选择非拉丁字符的\P{Latin}和\s以便不匹配空格：

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);

use utf8;
binmode(STDOUT, ':utf8');  # Why is this needed when you specify "use utf8;"?

my $string = "ভাৰত is a famous country";
$string =~ s/([^\p{Latin}\s])/$1 /g;  # Put a space after all non-latin chars
say $string;

这将打印出来：

ভ া ৰ ত  is a famous country

唯一的问题是ত之后的双倍空格。

Answer 3

这正是你告诉它的。 @a=split('')会拆分整行，你不会告诉它只拆分第一个单词。您首先需要识别要拆分的子字符串，然后将其拆分：

#!/usr/bin/perl
use utf8;
use Getopt::Std;
use IO::Handle;

binmode(STDIN,  ':utf8');
binmode(STDOUT, ':utf8');
binmode(STDERR, ':utf8');

while(<>)
{
    chomp;
    ## find the first word, capture it as $1 and delete it from the line
    s/(.+?)\s//;
    @a=split('',$1);
    ## Print your joined string and the rest of the line
    print join(" ",@a) . " $_\n";
}

Answer 4

添加类似

的内容

$str =~ s/([\w]) (?<=[\w.,;:!?])/$1/g;

打算删除拉丁文字符之间的空格。随着前瞻。不是100％。

需要拆分Unicode字符串

4 个答案: