Question

我有多列的文件，我想读取特定列的值。我可以使用awk{print $column_number}阅读专栏。

每个文件具有不同的列长度，即一些可能是1000个条目长，而其他可能只有2个，依此类推。条目本身的范围从1位到最多5位。所有文件都是一样的。

我想计算最重复值的范围。例如，如果列为：

然后我希望将31,400存储为最重复的值，然后将20,000和52,000存储为第二个最重复的值和第三个重复值，依此类推。如果有意义的话，你可以说我正在整理数值以查看最重复的数字。这些值（重复次数最多，重复次数最多）可以被认为是100的倍数。所以基本上代码看起来应该是这样的：

for f in ls path-to-the-files/*

do

while read i

    do
    <do the operation to sort and store the values>
done

done

我很感激你的帮助！

Answer 1

您似乎想要计算100,0..99,100..199,200..299等每个范围内的值的数量，然后找到最大的此类范围。

你可能可以在awk（当然在Python中）这样做，但我将使用Perl。

我要将列号硬编码到程序中;如果需要，可以变量（例如命令行上的选项）。我选择了第3列，从0开始计算。

#!/usr/bin/env perl
use strict;
use warnings;
use constant colno => 3;

my %ranges;

while (<>)
{
    my(@fields) = split /\s+/;
    my($key) = int($fields[colno] / 100);
    $range{$key}++;
}

# The hash now contains the number of entries for each range that's present in the
# data.  Now we need to hack the data so that we can easily find the range(s) with
# the largest counts.
# Apply the Schwartzian Transform: http://en.wikipedia.org/wiki/Schwartzian_transform

my @results = map  { [$_->[0], $_->[1]]  }
              sort { $a->[1] <=> $b->[1] }
              map  { [$_, $ranges{$_}]   }
                   keys %ranges;

# And print the results
foreach my $ref (reverse @results)
{
    printf "%5d = %d\n", $ref->[0] * 100, $ref->[1];
}

对于样本数据（用前三列填充），输出为：

Schwartzian Transform是深黑魔法。这里可能没有必要，但它有效。（是的，这是我第一次使用它。）

Perl代码很有趣（并且可能相当快），但是如果你没有机器上的Perl，那么你需要一个替代方案。

awk '{value = int($3/100); print value*100;}' files |
sort |
uniq -c |
sort -nr

awk代码选取第3列（从1开始计数，而不是0！），将值除以100并将其转换为整数，然后打印值乘以100;这给了你想要的分组。剩余的sort | uniq -c | sort -nr管道是用于计算出现次数和排序的标准习惯用法，以便最常出现。实际上，将r排除在最终排序之外通常会更好，因此最后几行输出是最有趣的。

从列unix中提取数据

1 个答案: