Question

我有一个巨大的（大约1.7M列，每行2行）.csv文件，有点像：

Position 1 2 3 4 ... 1.6M
Coverage 1 1 1 2 ... 1

我需要提取第1个，然后是第1000个，第2000个等等，直到文件结束。我是编程之类的新手。用perl或awk这是可行的，怎么样？我可以访问Windows和Linux系统。提前谢谢！

迈克尔

Answer 1

试试这行：

awk -v n=1000 '{printf "%s%s", $1, FS;
                for(i=n;i<=NF;i+=n)printf "%s%s", $i, (i+n>NF?RS:FS)}' file

Answer 2

这可以简化为Perl中的单行程序：

perl -lane ' for (@F) { print if !($a++ % 1000) } ' yourfile.csv

这使用模数运算符%来检查列号是否为1000（或0）的倍数，如果是，则打印该值。 -a开关在空格上拆分线。如果要指定分隔符，例如\t您可以使用-F"\t"执行此操作。

如果将整行加载到内存中会降低程序速度，则可以使用输入记录分隔符。在这个例子中，我将它设置为空格：

perl -l -0040 -ane '!(($.-1) % 1000) and print ' yourfile.csv

这会将space视为输入记录分隔符，并在此时读取一列。 -l选项将chomp“行”并删除空格，并为打印提供换行符。 $.是当前行号。

Answer 3

awk以下程序应该这样做。在这里，我执行了一个有10000条记录的文件，同样可以在任意数量的记录上完成。

$ awk '{for(i=0;i<=NF; i+=1000){printf("%s ", $(i==0?1:i))} print "" }' file

输出：

1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 
1 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Answer 4

试试这个几乎可读的perl：

$ cat foo.pl
use strict;
use warnings;

my $every = 1000;
while (my $line = <STDIN>) {
    chomp $line;
    my @columns = split(' ', $line);
    print "$columns[0]";
    my $i = $every;
    while ($i < @columns) {
        print " $columns[$i]";
        $i += $every;
    }
    print "\n";
}

$ perl foo.pl < input.csv

Answer 5

我会试试：

cat > ex.txt
1 2 3 4 5 6 7 8 9 10 11 12 13
1 1 1 1 1 1 1 2 1 2  1  1  3

以及在线时：

perl -e 'open FH, "ex.txt"; $line1=<FH>; $line2=<FH>; @tab1=split(/\s+/, $line1); @tab2=split(/\s+/, $line2); for ($i=0; $i<14; $i+=4) { print $tab1[$i]."/".$tab2[$i]."\n"; } close FH;'

结果：

1/1
5/1
9/1
13/3

不在线上：

# open file
open FH, "ex.txt";
# extract the two lines
$line1=<FH>;
$line2=<FH>;
# extract the elements for each 
@tab1=split(/\s+/, $line1);
@tab2=split(/\s+/, $line2);
# and print, here step 4
for ($i=0; $i<14; $i+=4) { 
  print $tab1[$i]."/".$tab2[$i]."\n";
}
close FH;

如果你有160万件物品，

消耗大量内存！

仅从csv文件打印第n列

5 个答案: