Question

我有一个大数据集（12,000行X 14列）;前4行如下：

x1  y1  0.02    NAN NAN NAN NAN NAN NAN 0.004   NAN NAN NAN NAN
x2  y2  NAN 0.003   NAN 10  NAN 0.03    NAN 0.004   NAN NAN NAN NAN
x3  y3  NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN
x4  y4  NAN 0.004   NAN NAN NAN NAN 10  NAN NAN 30  NAN 0.004

我需要删除第3-14行中带有“NAN”的任何行，然后输出数据集的其余部分。我写了以下代码：

#!usr/bin/perl

use warnings;
use strict;
use diagnostics;

open(IN, "<", "file1.txt") or die "Can't open file for reading:$!";

open(OUT, ">", "file2.txt") or die "Can't open file for writing:$!";

my $header = <IN>;
print OUT $header;

my $at_line = 0;

my $col3;
my $col4;
my $col5;
my $col6;
my $col7;
my $col8;
my $col9;
my $col10;
my $col11;
my $col13;
my $col14;
my $col15;

while (<IN>){
chomp;
my @sections = split(/\t/);

$col3 = $sections[2];
$col4 = $sections[3];;
$col5 = $sections[4];
$col6 = $sections[5];
$col7 = $sections[6];
$col8 = $sections[7];
$col9 = $sections[8];
$col10 = $sections[9];
$col11 = $sections[10];
$col13 = $sections[11];
$col14 = $sections[12];
$col15 = $sections[13];

if ($col3 eq "NAN" && $col4 eq "NAN" && $col5 eq "NAN" && $col6 eq "NAN" && $col7 eq "NAN" && $col8 eq "NAN" && $col9 eq "NAN" && $col10 eq "NAN" 
&& $col11 eq "NAN" && $col12 eq "NAN" && $col13 eq "NAN" && $col14 eq "NAN" && $col5 eq "NAN"){
    $at_line = $.;
    }   
    else {
        print OUT "$_\n";
    }
}

close(IN);
close(OUT);

运行此代码会出现以下错误：

Use of uninitialized value $col3 in string eq at filter.pl
    line 46, <IN> line 2 (#1)

如何让这个程序运作？感谢。

Answer 1

一衬垫：

$ perl -lane 'print if join("", @F[2..13]) ne "NAN" x 12' <file1.txt >file2.txt

Answer 2

Zaid的单线程是您特定情况的最佳解决方案。一般而言，您的模式应该是

，而不是定义这么多标量

my @required_columns = (split /\s+/)[2..13]

您获得的错误似乎是由于您在数据集以空格分隔时在选项卡上拆分的事实。请记住，split采用正则表达式而不是字符串。

Answer 3

while (<IN>) {
    my @values = (split( /\s+/)[2..13];
    my $nan_count = grep { $_ eq 'NAN' } @values;
    print $_ unless $nan_count == 12;
}

Joseph R.有正确的方法来分割线条。

grep返回在标量上下文中调用时的匹配数，因此这提供了另一种检查以查看所有列是否等于NAN的方法。

从大型数据集中删除仅包含NAN的行

3 个答案: