Question

背景

希望在JasperServer中自动创建域。域是用于创建临时报告的数据的“视图”。列的名称必须以人类可读的方式呈现给用户。

问题

该组织理论上可以在报告中包含2,000多种可能的数据。数据来自非人类友好的名称，例如：

payperiodmatchcode labordistributioncodedesc 依赖关系行动 actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc受益人金额受益人受益人 Benefubclass受益人类 Beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod

问题

您如何自动将此类名称更改为：

支付期间匹配代码
劳务分配代码desc
依赖关系

观

使用Google的Did you mean引擎，但我认为这违反了他们的服务条款：

lynx -dump «url» | grep "Did you mean" | awk ...

语言

任何语言都可以，但像Perl这样的文本解析器可能非常适合。（列名仅限英语。）

不必要的完美

目标不是百分之百完美地分开单词;以下结果是可以接受的：

enrollmenteffectivedate - ＆gt;报名生效日期
enrollmentenddate - ＆gt;注册男士日期
enrollmentrequirementset - ＆gt;注册要求集

无论如何，人类需要仔细检查结果并纠正许多结果。将一组2,000个结果减少到600次编辑将节省大量时间。要注意一些具有多种可能性的案例（例如，治疗师名称）将完全忽略这一点。

Answer 1

有时，bruteforcing是可以接受的：

#!/usr/bin/perl

use strict; use warnings;
use File::Slurp;

my $dict_file = '/usr/share/dict/words';

my @identifiers = qw(
    payperiodmatchcode labordistributioncodedesc dependentrelationship
    actionendoption actionendoptiondesc addresstype addresstypedesc
    historytype psaddresstype rolename bankaccountstatus
    bankaccountstatusdesc bankaccounttype bankaccounttypedesc
    beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
    beneficiaryclass beneficiaryclassdesc benefitactioncode
    benefitactioncodedesc benefitagecontrol benefitagecontroldesc
    ageconrolagelimit ageconrolnoticeperiod
);

my @mydict = qw( desc );

my $pat = join('|',
    map quotemeta,
    sort { length $b <=> length $a || $a cmp $b }
    grep { 2 < length }
    (@mydict, map { chomp; $_ } read_file $dict_file)
);

my $re = qr/$pat/;

for my $identifier ( @identifiers ) {
    my @stack;
    print "$identifier : ";
    while ( $identifier =~ s/($re)\z// ) {
        unshift @stack, $1;
    }
    # mark suspicious cases
    unshift @stack, '*', $identifier if length $identifier;
    print "@stack\n";
}

输出：

payperiodmatchcode : pay period match code
labordistributioncodedesc : labor distribution code desc
dependentrelationship : dependent relationship
actionendoption : action end option
actionendoptiondesc : action end option desc
addresstype : address type
addresstypedesc : address type desc
historytype : history type
psaddresstype : * ps address type
rolename : role name
bankaccountstatus : bank account status
bankaccountstatusdesc : bank account status desc
bankaccounttype : bank account type
bankaccounttypedesc : bank account type desc
beneficiaryamount : beneficiary amount
beneficiaryclass : beneficiary class
beneficiarypercent : beneficiary percent
benefitsubclass : benefit subclass
beneficiaryclass : beneficiary class
beneficiaryclassdesc : beneficiary class desc
benefitactioncode : benefit action code
benefitactioncodedesc : benefit action code desc
benefitagecontrol : benefit age control
benefitagecontroldesc : benefit age control desc
ageconrolagelimit : * ageconrol age limit
ageconrolnoticeperiod : * ageconrol notice period

另见A Spellchecker Used to Be a Major Feat of Software Engineering。

Answer 2

我将你的列表缩减为我所关注的32个原子术语，并将它们放在正则表达式中最长的第一个排列中：

use strict;
use warnings;

my $qr 
    = qr/ \G # right after last match
          ( distribution 
          | relationship 
          | beneficiary 
          | dependent 
          | subclass 
          | account
          | benefit 
          | address 
          | control 
          | history
          | percent 
          | action 
          | amount
          | conrol 
          | option 
          | period 
          | status 
          | class 
          | labor 
          | limit 
          | match 
          | notice
          | bank
          | code 
          | desc 
          | name 
          | role 
          | type 
          | age 
          | end 
          | pay
          | ps 
          )
    /x;

while ( <DATA> ) { 
    chomp;
    print;
    print ' -> ', join( ' ', m/$qr/g ), "\n";
}

__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod

Answer 3

我发生了两件事：

这不是一个你可以自信地以编程方式进行攻击的任务，因为...英语单词不能像那样工作，它们通常由其他单词组成，因此，是一个给定的字符串“reportage”或“report”年龄”？ “钟表”或“时间片”？
攻击问题的一种方法是使用找到字谜的anag。毕竟，"time piece" is an anagram of "timepiece" ......现在你只需要清除误报。

Answer 4

这是一个从字典中尝试最长匹配的Lua程序：

local W={}
for w in io.lines("/usr/share/dict/words") do
    W[w]=true
end

function split(s)
    for n=#s,3,-1 do
        local w=s:sub(1,n)
        if W[w] then return w,split(s:sub(n+1)) end
    end
end

for s in io.lines() do
    print(s,"-->",split(s))
end

Answer 5

鉴于某些单词可能是其他单词的子串，特别是多个单词被拼凑在一起，我认为像正则表达式这样的简单解决方案已经出来了。我会选择一个完整的解析器，我的经验是使用ANTLR。如果你想坚持使用perl，我很幸运使用通过Inline :: Java生成为Java的ANTLR解析器。

Answer 6

Peter Norvig有一个很棒的python脚本，它使用unigram / bigram统计信息进行分词功能。您想要查看ngrams.py中函数segment2的逻辑。详细信息请参阅Beautiful Data（Segaran和Hammerbacher，2009）一书中的Natural Language Corpus Data一章。 http://norvig.com/ngrams/

如何用空格分隔“句子”中的单词？

背景

问题

问题

观

语言

不必要的完美

6 个答案: