我在IPA字符(国际拼音字母)中显示的文件中有以下单词列表(words.txt)。
下面,我已在单独的文件(sounds.txt)中为每个IPA字符分配了二进制代码。我想比较sounds.txt文件中每个“字符”的值(例如下面的“b”或“ŋ”)来比较words.txt文件中的每个单词。
我想将单词及其数值结果打印到单独的文件中。
第一个所需的输出示例:bʀɥi和fʀɥi的输出值将为5,因为字符“b”和“f”的两个二进制字符串在5个位置不同。
"b":[10000100000000010000]
"f":[00100010000000000000]
第二个例子:bʀɥi和plɥi的输出值为6,因为字符“b”和“p”在1处不同,字符“ʀ”和“1”在5个地方不同。计算每对单词的最终值是每个字符的二进制代码差异的总和。
"b":[10000100000000010000]
"p":[10000100000000000000]
"ʁ":[00100000000001010000]
"l":[00011000100000010000]
我知道计算每个字母的代码看起来像这样但我不知道如何合并sound.txt文件中的值,然后从两个整个单词中获取比较值。我一直在阅读很多perl教程,但我看到的任何内容似乎都与我想要完成的内容类似。任何建议都会很棒。
open(my $f1, "words.txt");
string1 [$f1]
string2 [$f1]
for (i=0,i<string.length,i++)
if(string1[i]!=string2[i])
sum = sum+1
bʀɥi
kʀwa
dʀwa
fʀwa
fʀɥi
ɡʀwɛ̃
plɥi
pʀwa
tʀɥi
"p":[10000100000000000000]
"b":[10000100000000010000]
"f":[00100010000000000000]
"v":[00100010000000010000]
"t":[10000001000000000000]
"d":[10000001000000010000]
"k":[10000000000010000000]
"g":[10000000000010010000]
"s":[00100000100000000000]
"z":[00100000100000010000]
"m":[01000100000000010000]
"n":[01000001000000010000]
"ɲ":[01000000001000010000]
"ŋ":[01000000000010010000]
"ʃ":[00100000010000000000]
"ʒ":[00100000010000010000]
"ʀ":[00100000000001010000]
"w":[00010000000000110000]
"j":[00010000001000010000]
"ɥ":[00010000000100010000]
"l":[00011000100000010000]
"a":[00000000001000011000]
"ɑ":[00000000000010011000]
"ɑ̃":[01000000000010011000]
"e":[00000000001000010010]
"ɛ":[00000000001000010100]
"ɛ̃":[01000000001000010100]
"ə":[00000000000000000000]
"i":[00000000001000010001]
"o":[00000000000000110010]
"ɔ":[00000000000000110100]
"ɔ̃":[01000000000000110100]
"œ":[00000000000100010100]
"œ̃":[01000000000100010100]
"ø":[00000000000100010010]
"u":[00000000000000110001]
"y":[00000000000100010001]
答案 0 :(得分:1)
将映射从IPA字符存储到散列中的二进制代码。您不能简单地将每个单词分解为字符并将它们映射到散列,因为某些“字符”不是由Unicode中的单个代码点表示。所以,我只是用代码替换了每个已知的组合,然后使用XOR删除常见的或零。
您的样本中缺少一些字符,我必须添加它们(ʀ和ɡ)。
#!/usr/bin/perl
use warnings;
use strict;
use open IO => 'encoding(utf-8)', ':std';
my @words;
open my $WORDS, '<:encoding(utf-8)', 'words.txt' or die $!;
chomp(@words = <$WORDS>);
my %sound;
open my $SOUNDS, '<:encoding(utf-8)', 'sounds.txt' or die $!;
while (<$SOUNDS>) {
my ($ipa, $features) = /"(.*?)":\[([01]+)\]/;
$sound{$ipa} = $features;
}
my $chars = join '|', sort { length $b <=> length $a } keys %sound;
my $regex = qr/($chars)/;
my @sounds;
for my $word (@words) {
(my $wsound = $word) =~ s/$regex/$sound{$1},/g; # / SO bug
push @sounds, $wsound;
}
for my $i1 (0 .. $#words - 1) {
for my $i2 ($i1 + 1 .. $#words) {
warn "Different length: $words[$i1] - $words[$i2]"
if length $sounds[$i1] != length $sounds[$i2];
my $hamming = $sounds[$i1] ^ $sounds[$i2];
$hamming =~ tr/\0//d;
$hamming = length $hamming;
print "$words[$i1] - $words[$i2] : $hamming\n";
}
}