我有一个很大的(800K - 唯一且已排序的)数字列表。例如
1002230091 => 1002230091 <- not a complete set of digits
...
1112223000 --
1112223001 |
1112223002 |
... | => 111223
1112223009 |
... |
1112223999 |
... |
1112223999 --
...
上面的数字可以分组为公共前缀:
111222300[0..9] <-- a.k.a called complete set of digits
注意前缀本身可以有一组完整的数字,因此如果是这样,它也应该被分组。
预期结果(假设经过分析后发现找到了所有完整的数字集):
1112223
10022330091
我尝试使用Tree :: Trie(用于更快的查找)和普通的旧散列(用于迭代键)来创建脚本。
我放在一起的逻辑没有到达根前缀,它只执行一轮分组:
1000 --
1001 |
1002 | => 100
... |
1009 --
1010 => 1010
此外,迭代这一数据量的速度非常慢。
我确信有更好的替代**,既可以从速度上处理这些数据,也可以满足这一需求。
非常感谢您在满足这一需求方面的建议/帮助。我最熟悉Shell或Perl脚本,但是,可以使用任何类型的脚本解决方案。
这是我放在一起的逻辑,它进行了一轮分组,但是,没有进行第二轮分组。
#!/usr/bin/perl -w
use Tree::Trie;
use strict;
use Getopt::Long;
use Pod::Usage;
my %w_mk;
my $csv = "./test.csv";
my $debug = 1;
my($trie) = new Tree::Trie;
my $help = 0;
my $man = 0;
my $cycle = 1;
my $max_key_length = 1;
my $min_key_length = 1;
GetOptions("debug=i" => \$debug,
"source_file|s=s" => \$csv,
"cycle|c=i" => \$cycle,
"help|?" => \$help,
"man!" => \$man
) or pod2usage("Try '$0 --help' for more information." );
pod2usage(-verbose => 99, -section => "NAME") if $help;
pod2usage(-verbose => 2) if $man;
sub clean_ds
{
my ($key, @keys) = @_;
my $key_len = scalar @keys;
if ($key_len == 10) {
foreach my $k (@keys) {
$trie->remove($k);
}
print "\t\tRoot key $key found!!\n" if ($debug > 1);
## Add this working key as a new key
$w_mk{$key} = 2;
## remove all of the related complete keys
delete @w_mk{@keys};
print "\t\tRemoved keys: [@keys]\n\n" if ($debug > 1);
}
}
sub is_complete_key
{
my ($key) = @_;
my $len = length $key;
my (@key_list) = $trie->lookup($key, $len + 1);
my ($key_list_len) = scalar @key_list;
## When a key has been processed once,
## let's mark it that it has been processed
$w_mk{$key} = 2;
print "\t\tSearch for key: '$key'\n\t\tNo. of items found: $key_list_len\n\t\titems : [@key_list]\n" if ($debug >= 3);
# Complete DNIS found
if ($key_list_len == 10) {
#because trie lookup when prefix length is supplied returns only the suffix portion
#e.g. 1000, 1001, 1002, 1003
#when lookup('100', 4) returns 0, 1, 2, 3
#update the returned key list by prepending it with the original key
my @t_key_list = @key_list;
for my $elem (@t_key_list) {
$elem = $key.$elem;
}
clean_ds($key, @t_key_list);
return (1, @t_key_list);
}
else {
print "\t\tRoot key $key not adding!!\n\n" if ($debug > 1);
}
return (0, @key_list);
}
open (my $handle, '<', $csv) or die "Could not open file '$csv' $!";;
while (my $row = <$handle>) {
chomp($row);
my $k_len = length($row);
$max_key_length = $k_len if ($k_len > $max_key_length);
$trie->add($row);
$w_mk{$row} = 1;
print "data: '$row'\n" if ($debug >= 4);
}
close ($handle);
sub group_keys
{
my ($key, $iteration) = @_;
my $value = 0;
if (exists $w_mk{$key}) {
$value = $w_mk{$key};
chomp($value);
}
while ($value >= $iteration && length $key > 1) {
chop($key); # Remove last character of the key
if (exists $w_mk{$key}) {
$value = $w_mk{$key};
chomp($value);
}
print "\t(w_key => w_value): '$key' => '$value'\n" if ($debug >= 2);
## If the working key has already been processed once,
## no need to reprocess it
if ($value < 2) {
my ($st, @w_key_list) = is_complete_key($key);
##
## if number of keys found is less than 10
## no need to continue to chop the key
## go to the next key
##
#if ($st == 0) {
last;
#}
}
}
}
sub go_through_keys
{
my ($lcycle) = @_;
print "Reduction Cycle: '$lcycle'\n\n" if ($debug >= 3);
foreach my $key (sort keys %w_mk) {
my $w_key = $key;
my $w_value = 0;
if (exists $w_mk{$w_key}) {
$w_value = $w_mk{$w_key};
chomp($w_value);
}
print "(Key => Value): '$key' => '$w_value'\n" if ($debug >= 2);
if ($debug >= 3) {
my (@keys) = $trie->lookup($key);
my $key_len = scalar @keys;
print "\t\tNo. of items found: $key_len\n\t\titems : [@keys]\n" if ($debug >= 3);
}
group_keys($w_key, $lcycle);
}
}
sub reset_key_values
{
foreach my $key (keys %w_mk) {
$w_mk{$key} = 1;
}
}
for (my $i=$min_key_length; $i < $max_key_length; $i++) {
go_through_keys($i);
# reset values for each key
#reset_key_values();
}
print "$_\n" for sort keys %w_mk;
__END__
=head1 NAME
group_dnis.pl - A script to group and reduce a list of numbers
=head1 SYNOPSIS
group_dnis.pl - A script to group and reduce a list of numbers
------------------------------
dnis(s) => common root
------------------------------
1000 --
1001 |
1002 | ==> 100
1003 |
... |
1009 --
1010 ==> 1010
group_dnis.pl [options]
Options:
-help brief help message
-man full documentation
=head1 OPTIONS
=over 4
=item B<-source_file>
Source file contain list of numbers to be groupped.
=item B<-help>
Prints usage with some examples of how to use this script.
group_dnis.pl -s <file name>
=back
Documentation ends here.
=cut
答案 0 :(得分:0)
这里是JavaScript中的线性内容(假设list
已排序)。转换为AWK不应该太糟糕。不确定它是否完全证明...可能想要针对真实数据进行调试。
function f(list){
var i = 0, j = 9, k = 0, tempList = [list[i]];
function group(){
while (list[i + 1] && list[i].substr(0,j) == list[i + 1].substr(0,j)
&& Number(list[i].substr(j - 10)) + 1 == list[i + 1].substr(j - 10)){
tempList.push(list[i + 1]);
i++;
}
}
function isComplete(){
return Number(tempList[0].substr(j-10)) + Math.pow(10,10 - j) - 1
== tempList[tempList.length - 1].substr(j-10);
}
while (i < list.length - 2){
group();
if (isComplete()){
if (list[i + 1] && list[i].substr(0,j - 1) == list[i + 1].substr(0,j - 1)
&& Number(list[i].substr(j - 1 - 10)) + 1 == list[i + 1].substr(j - 1 - 10)){
j--;
k++;
} else {
console.log(tempList[0].substr(0,j)); // output
tempList = [list[++i]];
j = 9; k = 0;
}
} else {
console.log(tempList[0].substr(0,j + 1)) // output
for (l=Math.pow(10,k); l<tempList.length; l++)
console.log(tempList[l]); // output
tempList = [list[++i]];
j = 9; k = 0;
}
}
}
输出:
console.log(f(['1002230091','1112223000','1112223001','1112223002','1112223003'
,'1112223004','1112223005','1112223006','1112223007','1112223008'
,'1112223009']));
/*
1002230091
111222300
*/
console.log(f(['1002230091','1112223000','1112223001','1112223002','1112223003'
,'1112223004','1112223005','1112223006','1112223007','1112223008'
,'1112223009','1112223010','1112223011','1112223012']));
/*
1002230091
111222300
1112223010
1112223011
1112223012
*/