给定样本输入的实际输出

Question

是否有特定的库，算法或技术（除了使用正则表达式）如果您想转换/翻译以下行，请使用。

"Acme Corporation Inc., John, Doe, F."
"Smith, Allen, Smith,Susan"
"Marshall, J., L., Johnson, H., Caruso, D., Jones, J."
"Stein, Harry, Joan, and Mike"

这些行应转换为包含以下内容的文本：

Acme {TAB} Corporation
Doe {TAB} John
Smith {TAB} Allen
Smith {TAB} Susan
Marshall {TAB} J.
Johnson {TAB} H.
Caruso {TAB} D.
Jones {TAB} J.
Stein {TAB} Harry
Stein {TAB} Joan
Stein {TAB} Mike

原始文本仅包含专有名称和中间名首字母（D.或J.），除了偶尔“和”分隔同名的兄弟姐妹，与最后一行相同上面的原始文本。

此外，这被认为是“命名实体识别”还是有一些其他技术这个过程的名称？

理想情况下，我希望使用像Ruby / Python / Perl / PHP这样的语言的代码或算法翻译这个翻译。

任何想法？提前谢谢。

Answer 1

这很有效：

#!/usr/bin/env perl
use strict;
use warnings;

my $tok = undef;
my @pairs = ();
my $looking_for = 'surname';

sub parse_line_to_words($){
    my $l = shift;
    my @words;
    my $word = '';
    my $start = 1;

    # remove trailing newlines
    chomp $l;
    if(index($l, '"', -1) != -1){
            # remove trailing quotation mark.
            chop $l;
    }
    foreach my $c (split//,$l){
            if($c eq '"'){
                    if($#words == -1){
                            # skip leading quotation marks
                            next;
                    }
            }

            if($c eq ','){
                    push(@words, $word);
                    $word = '';
                    $start = 1;
            } else{
                    if($start && $c eq ' '){
                            next;
                    } else{
                            $start = 0;
                    }
                    $word .= $c;
            }
    }
    if($word ne ''){
            push(@words, $word);
    }
    return @words;
}
sub peek_and(@){
    foreach my $word (@_){
            return 1 if $word eq 'and'
    }
    return 0;
}
sub split_and(@){
    my @copy;
    foreach my $word (@_){
            if(index($word, 'and ', 0) != -1){
                    my $i = index($word, 'and ', 0) + 4;
                    push(@copy, substr($word, 0, $i - 1));
                    push(@copy, substr($word, $i));
            } else{
                    push(@copy, $word);
            }
    }
    return @copy;
}
sub count_spaces($){
    my $w = shift;
    my $s=0;
    for(my $p = index($w, ' ', 0); $p != -1; $p=index($w, ' ', $p+1), $s++) {}
    return $s;
}
sub found($$$){
    my $pairs = shift;
    push(@{$pairs}, {'surname' => shift, 'firstname' => shift});
}
while(<>){
    chomp;
    my $line = $_;
    my @words = parse_line_to_words($line);
    @words = split_and(@words);
    my $line_has_and = peek_and(@words);
    foreach my $word (@words){
            my $spaces = count_spaces($word);

            if($looking_for eq 'surname'){
                    if(index($word, '.', -1) != -1 && $spaces == 0){
                            # looks like an initial to me, skip it
                    } else{
                            if($spaces > 0){
                                    # multi-word token; must be corporation name
                                    my($f, $l) = split(/ /, $word);
                                    found(\@pairs, $f, $l);
                            } else{
                                    $tok = $word;
                                    $looking_for = 'firstname';
                            }
                    }
            } elsif ($looking_for eq 'firstname'){
                    if($line_has_and){
                            # lastname, first1, ..., firstn and firstn+1
                            if($word ne 'and'){
                                    found(\@pairs, $tok, $word);
                            }
                    } else{
                            # lastname, f. or lastname, firstname
                            found(\@pairs, $tok, $word);
                            $looking_for = 'surname';
                    }
            }
    }
    $looking_for = 'surname'; # reset for new line
}

foreach my $p (@pairs){
    printf("%s\t%s\n", $p->{'surname'}, $p->{'firstname'});
}

给定样本输入的实际输出

Acme    Corporation
John    Doe
Smith   Allen
Smith   Susan
Marshall        J.
Johnson H.
Caruso  D.
Jones   J.
Stein   Harry
Stein   Joan
Stein   Mike

讨论

我采用了以下启发式方法：

应忽略一行中的前导和尾随引号。
每一行都可以用一系列以逗号分隔的值标记为单词。
如果单词以空格字符开头，则应忽略这些字符。
任何一对单词的第一个单词是姓氏，第二个单词是名字（特殊情况除外）。
如果一行上的单词以'和'开头，则应该特别处理整行，其中第一个单词是姓氏，其余的是相应的名字。
如果姓氏的空格超过0，那么它就是公司的名称
公司名称总是两个以空格分隔的单词，应分别视为姓氏和名字。
非公司名称不包含空格。

最后我用“正则表达式”只是为了在空间上分割公司名称;这可以简单地用非正则表达式替换。

即使有了这一切，我仍然得到“John Doe”错误，因为它的名字在输入中是相反的。我无法设计出一种可靠的方法来检测它。

将名称列表分成：“FirstName {TAB}姓氏”对

1 个答案:

给定样本输入的实际输出

讨论