Question

我正在尝试操作列表（大约50列），我基本上想要选择一些列（大约7或10）。但是，其中一些列具有空条目。我猜这样的东西是一个最小的工作示例：

A    B    C    D    E#note these are 5 tab separated columns
this    that    semething    something more     the end 
this.line    is    very    incomplete    #column E empty
but    this    is    v.very    complete
whereas        this    is    not #column B empty

如您所见，第3行在最后位置为空。

我想找到一种方法有效地用字符串替换我感兴趣的所有空字段，比如“NA”。

当然，我可以通过以下方式实现，但对于我在实际数据中拥有的所有10列，这样做并不是很优雅：

#!/usr/local/bin/perl
use strict;
use warnings;

open my $file,"<","$path\\file.txt"; #with correct path

my @selecteddata;my $blankE;my $blankB;
while (<$data>) {
    chomp $_;
    my @line= split "\t";
    if (not defined $line[4]){
    $blankE="NA";
} else {
    $blankE=$line[4];
}
    if (not defined $line[1]){
    $blankB="NA";
} else {
    $blankB=$line[1];
}
    push @selecteddata,"$blankB[0]\t$line[1]\t$line[2]\t$line[3]$line[4]\n";
}
close $data;

或者，我可以预处理文件并用“NA”替换所有未定义的条目，但我想避免这种情况。

所以主要的问题是：是否有更优雅的方式来替换我感兴趣的列中的空白条目？

谢谢！

Answer 1

不忽略尾随制表符的技巧是将负LIMIT指定为split的第4个参数（kudos ikegami）。

map轻松设置＆＃34; NA＆＃34;值：

while ( <$data> ) {
    chomp;

    my @fields = split /\t/, $_, -1;

    @fields = map { length($_) ? $_ : 'NA' } @fields;  # Transform @fields

    my $updated = join("\t", @fields) . "\n";

    push @selected_data, $updated ;
}

在单线模式中：

$ perl -lne 'print join "\t", map { length ? $_ : "NA" } split /\t/, $_, -1' input > output

Answer 2

我会说使用split和join无疑是最清楚的，因为你可能也需要为其他解析做这件事。但是，这可以使用look around assertions以及

来解决

基本上，元素之间的边界要么是制表符，要么是字符串的结尾或开头，所以如果这两个方向的条件都为真，那么我们有一个空字段：

use strict;
use warnings;

while (<DATA>) {
    s/(?:^|(?<=\t))(?=\t|$)/NA/g;
    print;
}

__DATA__
a   b   c   d   e
a   b   c   d   e
a   b       d   e
    b   c   d   e
a   b           
a   b       d   
a               e

输出：

a       b       c       d       e
a       b       c       d       e
a       b       NA      d       e
NA      b       c       d       e
a       b       NA      NA      NA
a       b       NA      d       NA
a       NA      NA      NA      e

将此转换为单行内容很简单，但我会指出可以使用\K完成此操作，同时保存2个字符：s/(?:\t|^)\K(?=\t|$)/NA/g;

Answer 3

我不确定是否只是使用一系列替换来查找以空格开头/后面的选项卡会捕获所有内容但是如果你有一个懒惰的大脑它会快速而简单;-) < / p>

 perl -pne 's/\t\t/\tNA\t/;s/\t\s/\tNA/;s/^\t/NA\t/' col_data-undef.txt

我不确定是否采用整洁的脚本格式，看起来不那么令人讨厌： - ）

#!/usr/bin/env perl
# read_cols.pl - munge tab separated data with empty "cells"
use strict; 
use warnings;

while (<>){
 s/\t\t/\tNA\t/;
 s/\t\s/\tNA/;
 s/^\t/NA\t/;
 print ;
}

这是输出：

这里是输入和输出的vim缓冲区，标签为 ^ I 红色;-)

./read_cols.pl col_data-undef.txt > col_data-NA.txt

Buffers showing tabs

一切都是正确的顺序吗？它会在50列上工作吗？！？

有时懒惰很好但有时你需要@ikegami ...... ： - ）

在列表中处理未定义的条目

3 个答案: