Question

我在数据库中有很多供应商，他们在数据的某些方面都有所不同。我想制作基于以前数据的数据验证规则。

示例：

A: XZ-4, XZ-23, XZ-217
B: 1276, 1899, 22711
C: 12-4, 12-75, 12

目标：如果用户为供应商B输入字符串'XZ-217'，算法应比较先前的数据并说：此字符串与供应商B之前的数据不相似。

有没有一些好方法/工具来实现这种比较？答案可能是一些通用的算法或Perl模块。

编辑：我同意，“相似性”很难界定。但是我想抓住算法，它可以分析之前的100个样本，然后将分析结果与新数据进行比较。相似性可能基于长度，字符/数字的使用，字符串创建模式，类似的开头/结尾/中间，有一些分隔符。

我觉得这不是一件容易的事，但另一方面，我觉得它有很广泛的用途。我希望，已经有了一些提示。

Answer 1

您可能想要仔细阅读： http://en.wikipedia.org/wiki/String_metric和http://search.cpan.org/dist/Text-Levenshtein/Levenshtein.pm（例如）

Answer 2

这是我的实现和测试用例的循环。基本上你给函数列出了一个好的值，它试图为它构建一个正则表达式。

输出：

A: (?^:\w{2,2}(?:\-){1}\d{1,3})
B: (?^:\d{4,5})
C: (?^:\d{2,2}(?:\-)?\d{0,2})

代码：

#!/usr/bin/env perl

use strict;
use warnings;

use List::MoreUtils qw'uniq each_arrayref';

my %examples = (
  A => [qw/ XZ-4 XZ-23 XZ-217 /],
  B => [qw/ 1276 1899 22711 /],
  C => [qw/ 12-4 12-75 12 /],
);

foreach my $example (sort keys %examples) {
  print "$example: ", gen_regex(@{ $examples{$example} }) || "Generate failed!", "\n";
}

sub gen_regex {
  my @cases = @_;

  my %exploded;

  # ex. $case may be XZ-217
  foreach my $case (@cases) {
    my @parts = 
      grep { defined and length } 
      split( /(\d+|\w+)/, $case );

    # @parts are ( XZ, -, 217 )

    foreach (@parts) {
      if (/\d/) {
        # 217 becomes ['\d' => 3]
        push @{ $exploded{$case} }, ['\d' => length];

      } elsif (/\w/) {
        #XZ becomes ['\w' => 2]
        push @{ $exploded{$case} }, ['\w' => length];

      } else {
        # - becomes ['lit' => '-']
        push @{ $exploded{$case} }, ['lit' => $_ ];

      }
    }
  }

  my $pattern = '';

  # iterate over nth element (part) of each case
  my $ea = each_arrayref(values %exploded);
  while (my @parts = $ea->()) {

    # remove undefined (i.e. optional) parts
    my @def_parts = grep { defined } @parts;

    # check that all (defined) parts are the same type
    my @part_types = uniq map {$_->[0]} @def_parts;
    if (@part_types > 1) {
      warn "Parts not aligned\n";
      return;
    }
    my $type = $part_types[0]; #same so make scalar

    # were there optional parts?
    my $required = (@parts == @def_parts);

    # keep the values of each part
    # these are either a repitition or lit strings
    my @values = sort uniq map { $_->[1] } @def_parts;

    # these are for non-literal quantifiers
    my $min = $required ? $values[0] : 0;
    my $max = $values[-1];

    # write the specific pattern for each type
    if ($type eq '\d') {
      $pattern .= '\d' . "{$min,$max}";

    } elsif ($type eq '\w') {
      $pattern .= '\w' . "{$min,$max}";

    } elsif ($type eq 'lit') {
      # quote special characters, - becomes \-
      my @uniq = map { quotemeta } uniq @values;
      # join with alternations, surround by non-capture grouup, add quantifier
      $pattern .= '(?:' . join('|', @uniq) . ')' . ($required ? '{1}' : '?');
    }
  }


  # build the qr regex from pattern
  my $regex = qr/$pattern/;
  # test that all original patterns match (@fail should be empty)
  my @fail = grep { $_ !~ $regex } @cases;

  if (@fail) {
    warn "Some cases fail for generated pattern $regex: (@fail)\n";
    return '';
  } else {
    return $regex;
  }
}

为了简化找到图案的工作，可选部件可能会在最后，但在可选部件之后不会出现任何必要部件。这可能会被克服，但可能很难。

Answer 3

乔尔和我提出了类似的想法。以下代码区分了3种类型的区域。

一个或多个非单词字符
字母数字群集
一组数字

它创建字符串的配置文件和匹配输入的正则表达式。此外，它还包含扩展现有配置文件的逻辑。最后，在任务子中，它包含一些伪逻辑，指示如何将其集成到更大的应用程序中。

use strict;
use warnings;
use List::Util qw<max min>;

sub compile_search_expr { 
    shift;
    @_ = @{ shift() } if @_ == 1;
    my $str 
        = join( '|'
              , map { join( ''
                           , grep { defined; } 
                             map  {
                                 $_ eq 'P' ? quotemeta;
                               : $_ eq 'W' ? "\\w{$_->[1],$_->[2]}"
                               : $_ eq 'D' ? "\\d{$_->[1],$_->[2]}"
                               :             undef
                               ;
                            } @$_ 
                          )
                } @_ == 1 ? @{ shift } : @_
        );
    return qr/^(?:$str)$/;
}

sub merge_profiles {
    shift;
    my ( $profile_list, $new_profile ) = @_;
    my $found = 0;
    PROFILE:
    for my $profile ( @$profile_list ) { 
        my $profile_length = @$profile;

        # it's not the same profile.
        next PROFILE unless $profile_length == @$new_profile;
        my @merged;
        for ( my $i = 0; $i < $profile_length; $i++ ) { 
            my $old = $profile->[$i];
            my $new = $new_profile->[$i];
            next PROFILE unless $old->[0] eq $new->[0];
            push( @merged
                , [ $old->[0]
                  , min( $old->[1], $new->[1] )
                  , max( $old->[2], $new->[2] ) 
                  ]);
        }
        @$profile = @merged;
        $found = 1;
        last PROFILE;
    }
    push @$profile_list, $new_profile unless $found;
    return;
}

sub compute_info_profile { 
    shift;
    my @profile_chunks
        = map { 
              /\W/ ? [ P => $_ ]
            : /\D/ ? [ W => length, length ]
            :        [ D => length, length ]
        }
        grep { length; } split /(\W+)/, shift
        ;
}

# Psuedo-Perl
sub process_input_task { 
    my ( $application, $input ) = @_;

    my $patterns = $application->get_patterns_for_current_customer;
    my $regex    = $application->compile_search_expr( $patterns );

    if    ( $input =~ /$regex/ ) {}
    elsif ( $application->approve_divergeance( $input )) {
        $application->merge_profiles( $patterns, compute_info_profile( $input ));
    }
    else { 
        $application->escalate( 
           Incident->new( issue    => INVALID_FORMAT
                        , input    => $input
                        , customer => $customer 
                        ));
    }

    return $application->process_approved_input( $input );
}

Answer 4

如果有一个Tie::StringApproxHash模块，那么它就适合这个账单。

我认为你正在寻找一种结合String::Approx的模糊逻辑功能和Tie::RegexpHash的哈希接口的东西。

前者更重要;后者可以轻松编写代码。

如何根据实际数据自动创建模式？

4 个答案: