一种使用脚本分组数字的方法

时间:2015-09-06 19:38:47

标签: python algorithm shell awk

我有一个很大的(800K - 唯一且已排序的)数字列表。例如

    1002230091         => 1002230091 <- not a complete set of digits
    ...
    1112223000   --
    1112223001     |
    1112223002     |  
    ...            |   => 111223
    1112223009     |
    ...            |
    1112223999     |
    ...            |
    1112223999   --
    ...

上面的数字可以分组为公共前缀:

 111222300[0..9] <-- a.k.a called complete set of digits

注意前缀本身可以有一组完整的数字,因此如果是这样,它也应该被分组。

预期结果(假设经过分析后发现找到了所有完整的数字集):

1112223
10022330091 

我尝试使用Tree :: Trie(用于更快的查找)和普通的旧散列(用于迭代键)来创建脚本。

我放在一起的逻辑没有到达根前缀,它只执行一轮分组:

1000  --
1001    |
1002    | => 100
...     |
1009  --
1010      => 1010 

此外,迭代这一数据量的速度非常慢。

我确信有更好的替代**,既可以从速度上处理这些数据,也可以满足这一需求。

非常感谢您在满足这一需求方面的建议/帮助。我最熟悉Shell或Perl脚本,但是,可以使用任何类型的脚本解决方案。

这是我放在一起的逻辑,它进行了一轮分组,但是,没有进行第二轮分组。

#!/usr/bin/perl -w

use Tree::Trie;
use strict;
use Getopt::Long;
use Pod::Usage;

my %w_mk;
my $csv = "./test.csv";
my $debug = 1;
my($trie) = new Tree::Trie;
my $help = 0;
my $man  = 0;
my $cycle = 1;
my $max_key_length = 1;
my $min_key_length = 1;

GetOptions("debug=i"             => \$debug,
           "source_file|s=s"     => \$csv,
           "cycle|c=i"           => \$cycle,
           "help|?"              => \$help,
           "man!"                => \$man
           ) or pod2usage("Try '$0 --help' for more information." );

pod2usage(-verbose => 99, -section => "NAME") if $help;
pod2usage(-verbose => 2) if $man;

sub clean_ds
{
  my ($key, @keys) = @_;
  my $key_len = scalar @keys;

  if ($key_len == 10) {
    foreach my $k (@keys) {
      $trie->remove($k);
    }

    print "\t\tRoot key $key found!!\n" if ($debug > 1);

    ## Add this working key as a new key
    $w_mk{$key} = 2;

    ## remove all of the related complete keys
    delete @w_mk{@keys};

    print "\t\tRemoved keys: [@keys]\n\n" if ($debug > 1);
  }
}

sub is_complete_key
{
  my ($key) = @_;
  my $len = length $key;
  my (@key_list) = $trie->lookup($key, $len + 1);
  my ($key_list_len) = scalar @key_list;

  ## When a key has been processed once,
  ## let's mark it that it has been processed
  $w_mk{$key} = 2;

  print "\t\tSearch for key: '$key'\n\t\tNo. of items found: $key_list_len\n\t\titems : [@key_list]\n" if ($debug >= 3);

  # Complete DNIS found
  if ($key_list_len == 10) {
    #because trie lookup when prefix length is supplied returns only the suffix portion
    #e.g. 1000, 1001, 1002, 1003
    #when lookup('100', 4) returns 0, 1, 2, 3
    #update the returned key list by prepending it with the original key

    my @t_key_list =  @key_list;
    for my $elem (@t_key_list) {
      $elem = $key.$elem;
    }

    clean_ds($key, @t_key_list);

    return (1, @t_key_list);
  }
  else {
    print "\t\tRoot key $key not adding!!\n\n" if ($debug > 1);
  }

  return (0, @key_list);
}

open (my $handle, '<', $csv) or die "Could not open file '$csv' $!";;

while (my $row = <$handle>) {
  chomp($row);

  my $k_len = length($row);
  $max_key_length = $k_len if ($k_len > $max_key_length);

  $trie->add($row);
  $w_mk{$row} = 1;

  print "data: '$row'\n" if ($debug >= 4);
}

close ($handle);

sub group_keys
{
  my ($key, $iteration) = @_;

  my $value = 0;
  if (exists $w_mk{$key}) {
    $value = $w_mk{$key};
    chomp($value);
  }

  while ($value >= $iteration && length $key > 1) {
    chop($key); # Remove last character of the key

    if (exists $w_mk{$key}) {
      $value = $w_mk{$key};
      chomp($value);
    }

    print "\t(w_key => w_value): '$key' => '$value'\n" if ($debug >= 2);

    ## If the working key has already been processed once,
    ## no need to reprocess it
    if ($value < 2) {
      my ($st, @w_key_list) = is_complete_key($key);

      ##
      ## if number of keys found is less than 10
      ## no need to continue to chop the key
      ## go to the next key
      ##
      #if ($st == 0) {
        last;
      #}
    }
  }
}

sub go_through_keys
{
  my ($lcycle) = @_;

  print "Reduction Cycle: '$lcycle'\n\n" if ($debug >= 3);

  foreach my $key (sort keys %w_mk) {
    my $w_key = $key;
    my $w_value = 0;

    if (exists $w_mk{$w_key}) {
      $w_value = $w_mk{$w_key};
      chomp($w_value);
    }

    print "(Key => Value): '$key' => '$w_value'\n" if ($debug >= 2);
    if ($debug >= 3) {
      my (@keys) = $trie->lookup($key);
      my $key_len = scalar @keys;
      print "\t\tNo. of items found: $key_len\n\t\titems : [@keys]\n" if ($debug >= 3);
    }

    group_keys($w_key, $lcycle);
  }
}

sub reset_key_values
{
  foreach my $key (keys %w_mk) {
    $w_mk{$key} = 1;
  }
}

for (my $i=$min_key_length; $i < $max_key_length; $i++) {
  go_through_keys($i);
  # reset values for each key
  #reset_key_values();
}
print "$_\n" for sort keys %w_mk;

__END__

=head1 NAME

  group_dnis.pl - A script to group and reduce a list of numbers

=head1 SYNOPSIS

  group_dnis.pl - A script to group and reduce a list of numbers

              ------------------------------
                 dnis(s)  => common root
              ------------------------------
                 1000   --
                 1001    |
                 1002    | ==> 100
                 1003    |
                 ...     |
                 1009   --
                 1010      ==> 1010


group_dnis.pl [options]
  Options:
    -help     brief help message
    -man      full documentation

=head1 OPTIONS

=over 4

=item B<-source_file>

  Source file contain list of numbers to be groupped.

=item B<-help>

  Prints usage with some examples of how to use this script.

  group_dnis.pl -s <file name>

=back

  Documentation ends here.

=cut

1 个答案:

答案 0 :(得分:0)

这里是JavaScript中的线性内容(假设list已排序)。转换为AWK不应该太糟糕。不确定它是否完全证明...可能想要针对真实数据进行调试。

function f(list){         
  var i = 0, j = 9, k = 0, tempList = [list[i]];
  function group(){
    while (list[i + 1] && list[i].substr(0,j) == list[i + 1].substr(0,j) 
        && Number(list[i].substr(j - 10)) + 1 == list[i + 1].substr(j - 10)){
      tempList.push(list[i + 1]);
      i++;
    }
  }
  function isComplete(){
    return Number(tempList[0].substr(j-10)) + Math.pow(10,10 - j) - 1 
        == tempList[tempList.length - 1].substr(j-10);
  }
  while (i < list.length - 2){
    group();
    if (isComplete()){
      if (list[i + 1] && list[i].substr(0,j - 1) == list[i + 1].substr(0,j - 1) 
        && Number(list[i].substr(j - 1 - 10)) + 1 == list[i + 1].substr(j - 1 - 10)){
        j--;
        k++;
      } else {
        console.log(tempList[0].substr(0,j)); // output
        tempList = [list[++i]];
        j = 9; k = 0;
      }
    } else {
      console.log(tempList[0].substr(0,j + 1)) // output
      for (l=Math.pow(10,k); l<tempList.length; l++)
      console.log(tempList[l]); // output
      tempList = [list[++i]];
      j = 9; k = 0;
    }
  }
}

输出:

console.log(f(['1002230091','1112223000','1112223001','1112223002','1112223003'
              ,'1112223004','1112223005','1112223006','1112223007','1112223008'
              ,'1112223009']));
/*
  1002230091
  111222300
*/

console.log(f(['1002230091','1112223000','1112223001','1112223002','1112223003'
              ,'1112223004','1112223005','1112223006','1112223007','1112223008'
              ,'1112223009','1112223010','1112223011','1112223012']));

/*
1002230091
111222300
1112223010
1112223011
1112223012
*/