Question

我有两个制表符分隔文件，文件1包含标识符，文件2具有与这些标识符相关的值（或者说它是一个非常大的字典）。

档案1

Ronny
Rubby
Suzie
Paul

文件1只有一列。

文件2

Alistar Barm Cathy Paul Ronny Rubby Suzie Tom Uma Vai Zai
12      13    14   12     11   11   12    23 30  0.34 0.65
1       4     56   23     12   8.9  5.1   1  4    25  3

n文件2中存在多行。

我想要的是，如果文件1中存在文件1的标识符，我应该在另一个制表符分隔文件中包含与其相关的所有值。

这样的事情：

Paul Ronny Rubby Suzie
12     11   11   12
23     12   8.9  5.1

提前谢谢。

Answer 1

注意

您的示例输出不正确，因为您有“Ruby”但在您的file1示例中您有“Rubby”Ruby = / = Rubby

kent$ awk 'NR==FNR{t[$0]++;next} {if(FNR==1){ for(i=1;i<=NF;i++) if($i in t){ v[i]++; printf $i"\t"; } print ""; }else{ for(x in v) printf $x"\t" print ""; } }' file1 file2

<强>输出

Paul Ronny Suzie 12 11 12 23 12 5.1

Answer 2

$ awk 'FILENAME~1{a[$0];next};FNR==1{for(i=1;i<=NF;i++)if($i in a)b[i]};{for(j in b)printf("%s\t",$j);print ""}' file{1,2}.txt
Paul    Ronny   Suzie
12      11      12
23      12      5.1

分成多行＆amp;＆amp;添加空格

$ awk '
> FILENAME~1 { a[$0]; next }
> FNR==1 { for(i=1; i<=NF; i++) if($i in a) b[i] }
> { for(j in b) printf("%s\t",$j); print ""}
> ' file{1,2}.txt

Paul    Ronny   Suzie
12      11      12
23      12      5.1

Answer 3

您只能使用bash来执行此操作：

FIELDS=`head -1 f2.txt | tr "\t" "\n" | nl -ba | grep -f f1.txt | cut -f1 | tr -d " " | tr "\n" ","`; FIELDS=${FIELDS/%,/}
cut -f$FIELDS f2.txt 
Paul    Ronny   Ruby    Suzie
12  11  11  12
23  12  8.9 5.1

Answer 4

Python中用于在流中完成工作的示例（即：在开始输出之前不需要加载完整文件）：

# read keys
with open('file1', 'r') as fd:
    keys = fd.read().splitlines()

# output keys
print '\t'.join(keys)

# read data file, with header line and content
with open('file2', 'r') as fd:
    headers = fd.readline().split()
    while True:
        line = fd.readline().split()
        if len(line) == 0:
            break
        print '\t'.join([line[headers.index(x)] for x in keys if x in headers])

输出：

$ python test.py 
Ronny   Ruby    Suzie   Paul
11      11      12      12
12      8.9     5.1     23

Answer 5

Perl解决方案：

#!/usr/bin/perl
use warnings;
use strict;

open my $KEYS, '<', 'file1' or die $!;
my @keys = <$KEYS>;
close $KEYS;
chomp @keys;
my %is_key;
undef @is_key{@keys};

open my $TAB, '<', 'file2' or die $!;
$_ = <$TAB>;
my ($i, @columns);
for (split) {
    push @columns, $i if exists $is_key{$_};
    $i++;
}
do {{
    my @values = split;
    print join("\t", @values[@columns]), "\n";
}} while <$TAB>;

Answer 6

这样的事情可能会起作用，取决于你想要的东西。

use strict;
use warnings;

my %names;
open ( my $nh, '<', $name_file_path ) or die "Could not open '$name_file_path'!";
while ( <$nh> ) { 
    m/^\s*(.*?\S)\s*$/ and $names{ $1 } = 1; 
}
close $nh;

my $coln = -1;
open ( my $dh, '<', $data_file_path ) or die "Could not open '$data_file_path'!";

my ( @name_list, @col_list )
my @names = split /\t/, <$dh>;
foreach my $name ( 0..$#names ) {
    next unless exists $names{ $names[ $name ] };
    push @name_list, $name;
    push @col_list, $coln;
}
local $" = "\t";
print "@name_list\n";
print "@{[ split /\t/ ]}[ @col_list ]\n"  while <$dh>;
close $dh;

Answer 7

这可能对您有用：

 sed '1{s/\t/\n/gp};d' file2 |
 nl |
 grep -f file1 |
 cut -f1 |
 paste -sd, |
 sed 's/ //g;s,.*,cut -f& /tmp/b,' |
 sh

说明：

透视列名称
列名称
将列名与输入文件匹配。
删除保留列号的列名。
将列号分隔,。
从逗号分隔的列号列表中构建cut命令。
对数据文件运行cut命令。

从字典中提取数据

7 个答案: