我在数据库中有很多供应商,他们在数据的某些方面都有所不同。我想制作基于以前数据的数据验证规则。
示例:
A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12
目标:如果用户为供应商B输入字符串'XZ-217',算法应比较先前的数据并说:此字符串与供应商B之前的数据不相似。
有没有一些好方法/工具来实现这种比较?答案可能是一些通用的算法或Perl模块。
编辑: 我同意,“相似性”很难界定。但是我想抓住算法,它可以分析之前的100个样本,然后将分析结果与新数据进行比较。相似性可能基于长度,字符/数字的使用,字符串创建模式,类似的开头/结尾/中间,有一些分隔符。
我觉得这不是一件容易的事,但另一方面,我觉得它有很广泛的用途。我希望,已经有了一些提示。
答案 0 :(得分:2)
您可能想要仔细阅读: http://en.wikipedia.org/wiki/String_metric和http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm(例如)
答案 1 :(得分:1)
这是我的实现和测试用例的循环。基本上你给函数列出了一个好的值,它试图为它构建一个正则表达式。
输出:
A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})
代码:
#!/usr/bin/env perl
use strict;
use warnings;
use List::MoreUtils qw'uniq each_arrayref';
my %examples = (
A => [qw/ XZ-4 XZ-23 XZ-217 /],
B => [qw/ 1276 1899 22711 /],
C => [qw/ 12-4 12-75 12 /],
);
foreach my $example (sort keys %examples) {
print "$example: ", gen_regex(@{ $examples{$example} }) || "Generate failed!", "\n";
}
sub gen_regex {
my @cases = @_;
my %exploded;
# ex. $case may be XZ-217
foreach my $case (@cases) {
my @parts =
grep { defined and length }
split( /(\d+|\w+)/, $case );
# @parts are ( XZ, -, 217 )
foreach (@parts) {
if (/\d/) {
# 217 becomes ['\d' => 3]
push @{ $exploded{$case} }, ['\d' => length];
} elsif (/\w/) {
#XZ becomes ['\w' => 2]
push @{ $exploded{$case} }, ['\w' => length];
} else {
# - becomes ['lit' => '-']
push @{ $exploded{$case} }, ['lit' => $_ ];
}
}
}
my $pattern = '';
# iterate over nth element (part) of each case
my $ea = each_arrayref(values %exploded);
while (my @parts = $ea->()) {
# remove undefined (i.e. optional) parts
my @def_parts = grep { defined } @parts;
# check that all (defined) parts are the same type
my @part_types = uniq map {$_->[0]} @def_parts;
if (@part_types > 1) {
warn "Parts not aligned\n";
return;
}
my $type = $part_types[0]; #same so make scalar
# were there optional parts?
my $required = (@parts == @def_parts);
# keep the values of each part
# these are either a repitition or lit strings
my @values = sort uniq map { $_->[1] } @def_parts;
# these are for non-literal quantifiers
my $min = $required ? $values[0] : 0;
my $max = $values[-1];
# write the specific pattern for each type
if ($type eq '\d') {
$pattern .= '\d' . "{$min,$max}";
} elsif ($type eq '\w') {
$pattern .= '\w' . "{$min,$max}";
} elsif ($type eq 'lit') {
# quote special characters, - becomes \-
my @uniq = map { quotemeta } uniq @values;
# join with alternations, surround by non-capture grouup, add quantifier
$pattern .= '(?:' . join('|', @uniq) . ')' . ($required ? '{1}' : '?');
}
}
# build the qr regex from pattern
my $regex = qr/$pattern/;
# test that all original patterns match (@fail should be empty)
my @fail = grep { $_ !~ $regex } @cases;
if (@fail) {
warn "Some cases fail for generated pattern $regex: (@fail)\n";
return '';
} else {
return $regex;
}
}
为了简化找到图案的工作,可选部件可能会在最后,但在可选部件之后不会出现任何必要部件。这可能会被克服,但可能很难。
答案 2 :(得分:1)
它创建字符串的配置文件和匹配输入的正则表达式。此外,它还包含扩展现有配置文件的逻辑。最后,在任务子中,它包含一些伪逻辑,指示如何将其集成到更大的应用程序中。
use strict;
use warnings;
use List::Util qw<max min>;
sub compile_search_expr {
shift;
@_ = @{ shift() } if @_ == 1;
my $str
= join( '|'
, map { join( ''
, grep { defined; }
map {
$_ eq 'P' ? quotemeta;
: $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
: $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
: undef
;
} @$_
)
} @_ == 1 ? @{ shift } : @_
);
return qr/^(?:$str)$/;
}
sub merge_profiles {
shift;
my ( $profile_list, $new_profile ) = @_;
my $found = 0;
PROFILE:
for my $profile ( @$profile_list ) {
my $profile_length = @$profile;
# it's not the same profile.
next PROFILE unless $profile_length == @$new_profile;
my @merged;
for ( my $i = 0; $i < $profile_length; $i++ ) {
my $old = $profile->[$i];
my $new = $new_profile->[$i];
next PROFILE unless $old->[0] eq $new->[0];
push( @merged
, [ $old->[0]
, min( $old->[1], $new->[1] )
, max( $old->[2], $new->[2] )
]);
}
@$profile = @merged;
$found = 1;
last PROFILE;
}
push @$profile_list, $new_profile unless $found;
return;
}
sub compute_info_profile {
shift;
my @profile_chunks
= map {
/\W/ ? [ P => $_ ]
: /\D/ ? [ W => length, length ]
: [ D => length, length ]
}
grep { length; } split /(\W+)/, shift
;
}
# Psuedo-Perl
sub process_input_task {
my ( $application, $input ) = @_;
my $patterns = $application->get_patterns_for_current_customer;
my $regex = $application->compile_search_expr( $patterns );
if ( $input =~ /$regex/ ) {}
elsif ( $application->approve_divergeance( $input )) {
$application->merge_profiles( $patterns, compute_info_profile( $input ));
}
else {
$application->escalate(
Incident->new( issue => INVALID_FORMAT
, input => $input
, customer => $customer
));
}
return $application->process_approved_input( $input );
}
答案 3 :(得分:0)
如果有一个Tie::StringApproxHash
模块,那么它就适合这个账单。
我认为你正在寻找一种结合String::Approx
的模糊逻辑功能和Tie::RegexpHash
的哈希接口的东西。
前者更重要;后者可以轻松编写代码。