我有一个大数据集(12,000行X 14列);前4行如下:
x1 y1 0.02 NAN NAN NAN NAN NAN NAN 0.004 NAN NAN NAN NAN
x2 y2 NAN 0.003 NAN 10 NAN 0.03 NAN 0.004 NAN NAN NAN NAN
x3 y3 NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN NAN
x4 y4 NAN 0.004 NAN NAN NAN NAN 10 NAN NAN 30 NAN 0.004
我需要删除第3-14行中带有“NAN”的任何行,然后输出数据集的其余部分。我写了以下代码:
#!usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(IN, "<", "file1.txt") or die "Can't open file for reading:$!";
open(OUT, ">", "file2.txt") or die "Can't open file for writing:$!";
my $header = <IN>;
print OUT $header;
my $at_line = 0;
my $col3;
my $col4;
my $col5;
my $col6;
my $col7;
my $col8;
my $col9;
my $col10;
my $col11;
my $col13;
my $col14;
my $col15;
while (<IN>){
chomp;
my @sections = split(/\t/);
$col3 = $sections[2];
$col4 = $sections[3];;
$col5 = $sections[4];
$col6 = $sections[5];
$col7 = $sections[6];
$col8 = $sections[7];
$col9 = $sections[8];
$col10 = $sections[9];
$col11 = $sections[10];
$col13 = $sections[11];
$col14 = $sections[12];
$col15 = $sections[13];
if ($col3 eq "NAN" && $col4 eq "NAN" && $col5 eq "NAN" && $col6 eq "NAN" && $col7 eq "NAN" && $col8 eq "NAN" && $col9 eq "NAN" && $col10 eq "NAN"
&& $col11 eq "NAN" && $col12 eq "NAN" && $col13 eq "NAN" && $col14 eq "NAN" && $col5 eq "NAN"){
$at_line = $.;
}
else {
print OUT "$_\n";
}
}
close(IN);
close(OUT);
运行此代码会出现以下错误:
Use of uninitialized value $col3 in string eq at filter.pl
line 46, <IN> line 2 (#1)
如何让这个程序运作?感谢。
答案 0 :(得分:4)
一衬垫:
$ perl -lane 'print if join("", @F[2..13]) ne "NAN" x 12' <file1.txt >file2.txt
答案 1 :(得分:4)
Zaid的单线程是您特定情况的最佳解决方案。一般而言,您的模式应该是
,而不是定义这么多标量my @required_columns = (split /\s+/)[2..13]
您获得的错误似乎是由于您在数据集以空格分隔时在选项卡上拆分的事实。请记住,split
采用正则表达式而不是字符串。
答案 2 :(得分:1)
while (<IN>) {
my @values = (split( /\s+/)[2..13];
my $nan_count = grep { $_ eq 'NAN' } @values;
print $_ unless $nan_count == 12;
}
Joseph R.有正确的方法来分割线条。
grep
返回在标量上下文中调用时的匹配数,因此这提供了另一种检查以查看所有列是否等于NAN的方法。