希望在JasperServer中自动创建域。域是用于创建临时报告的数据的“视图”。列的名称必须以人类可读的方式呈现给用户。
该组织理论上可以在报告中包含2,000多种可能的数据。数据来自非人类友好的名称,例如:
payperiodmatchcode labordistributioncodedesc 依赖关系行动 actionendoptiondesc addresstype addresstypedesc historytype psaddresstype rolename bankaccountstatus bankaccountstatusdesc bankaccounttype bankaccounttypedesc受益人金额 受益人受益人 Benefubclass受益人类 Beneficiaryclassdesc benefitactioncode benefitactioncodedesc benefitagecontrol benefitagecontroldesc ageconrolagelimit ageconrolnoticeperiod
您如何自动将此类名称更改为:
使用Google的Did you mean引擎,但我认为这违反了他们的服务条款:
lynx -dump «url» | grep "Did you mean" | awk ...
任何语言都可以,但像Perl这样的文本解析器可能非常适合。 (列名仅限英语。)
目标不是百分之百完美地分开单词;以下结果是可以接受的:
无论如何,人类需要仔细检查结果并纠正许多结果。将一组2,000个结果减少到600次编辑将节省大量时间。要注意一些具有多种可能性的案例(例如,治疗师名称)将完全忽略这一点。
答案 0 :(得分:14)
有时,bruteforcing是可以接受的:
#!/usr/bin/perl
use strict; use warnings;
use File::Slurp;
my $dict_file = '/usr/share/dict/words';
my @identifiers = qw(
payperiodmatchcode labordistributioncodedesc dependentrelationship
actionendoption actionendoptiondesc addresstype addresstypedesc
historytype psaddresstype rolename bankaccountstatus
bankaccountstatusdesc bankaccounttype bankaccounttypedesc
beneficiaryamount beneficiaryclass beneficiarypercent benefitsubclass
beneficiaryclass beneficiaryclassdesc benefitactioncode
benefitactioncodedesc benefitagecontrol benefitagecontroldesc
ageconrolagelimit ageconrolnoticeperiod
);
my @mydict = qw( desc );
my $pat = join('|',
map quotemeta,
sort { length $b <=> length $a || $a cmp $b }
grep { 2 < length }
(@mydict, map { chomp; $_ } read_file $dict_file)
);
my $re = qr/$pat/;
for my $identifier ( @identifiers ) {
my @stack;
print "$identifier : ";
while ( $identifier =~ s/($re)\z// ) {
unshift @stack, $1;
}
# mark suspicious cases
unshift @stack, '*', $identifier if length $identifier;
print "@stack\n";
}
输出:
payperiodmatchcode : pay period match code labordistributioncodedesc : labor distribution code desc dependentrelationship : dependent relationship actionendoption : action end option actionendoptiondesc : action end option desc addresstype : address type addresstypedesc : address type desc historytype : history type psaddresstype : * ps address type rolename : role name bankaccountstatus : bank account status bankaccountstatusdesc : bank account status desc bankaccounttype : bank account type bankaccounttypedesc : bank account type desc beneficiaryamount : beneficiary amount beneficiaryclass : beneficiary class beneficiarypercent : beneficiary percent benefitsubclass : benefit subclass beneficiaryclass : beneficiary class beneficiaryclassdesc : beneficiary class desc benefitactioncode : benefit action code benefitactioncodedesc : benefit action code desc benefitagecontrol : benefit age control benefitagecontroldesc : benefit age control desc ageconrolagelimit : * ageconrol age limit ageconrolnoticeperiod : * ageconrol notice period
另见A Spellchecker Used to Be a Major Feat of Software Engineering。
答案 1 :(得分:1)
我将你的列表缩减为我所关注的32个原子术语,并将它们放在正则表达式中最长的第一个排列中:
use strict;
use warnings;
my $qr
= qr/ \G # right after last match
( distribution
| relationship
| beneficiary
| dependent
| subclass
| account
| benefit
| address
| control
| history
| percent
| action
| amount
| conrol
| option
| period
| status
| class
| labor
| limit
| match
| notice
| bank
| code
| desc
| name
| role
| type
| age
| end
| pay
| ps
)
/x;
while ( <DATA> ) {
chomp;
print;
print ' -> ', join( ' ', m/$qr/g ), "\n";
}
__DATA__
payperiodmatchcode
labordistributioncodedesc
dependentrelationship
actionendoption
actionendoptiondesc
addresstype
addresstypedesc
historytype
psaddresstype
rolename
bankaccountstatus
bankaccountstatusdesc
bankaccounttype
bankaccounttypedesc
beneficiaryamount
beneficiaryclass
beneficiarypercent
benefitsubclass
beneficiaryclass
beneficiaryclassdesc
benefitactioncode
benefitactioncodedesc
benefitagecontrol
benefitagecontroldesc
ageconrolagelimit
ageconrolnoticeperiod
答案 2 :(得分:1)
我发生了两件事:
anag
。毕竟,"time piece" is an anagram of "timepiece" ......现在你只需要清除误报。答案 3 :(得分:1)
这是一个从字典中尝试最长匹配的Lua程序:
local W={}
for w in io.lines("/usr/share/dict/words") do
W[w]=true
end
function split(s)
for n=#s,3,-1 do
local w=s:sub(1,n)
if W[w] then return w,split(s:sub(n+1)) end
end
end
for s in io.lines() do
print(s,"-->",split(s))
end
答案 4 :(得分:0)
鉴于某些单词可能是其他单词的子串,特别是多个单词被拼凑在一起,我认为像正则表达式这样的简单解决方案已经出来了。我会选择一个完整的解析器,我的经验是使用ANTLR。如果你想坚持使用perl,我很幸运使用通过Inline :: Java生成为Java的ANTLR解析器。
答案 5 :(得分:0)
Peter Norvig有一个很棒的python脚本,它使用unigram / bigram统计信息进行分词功能。您想要查看ngrams.py中函数segment2的逻辑。详细信息请参阅Beautiful Data(Segaran和Hammerbacher,2009)一书中的Natural Language Corpus Data一章。 http://norvig.com/ngrams/