Question

我有两个文件，其中一个只是一个列向量，例如：

1x23
1y21
1z21
1z25

和其他是形式

的矩阵

1x23 1x24 1y21 1y22 1y25 1z22 class
2000 3000 4000 5000 6000 7000 Yes
1500 1200 1100 1510 1410 1117 No

首先，我想找到第一个文件中的哪些行与第二个文件中的第一行匹配。其次，我想复制与第一个文件中的文件匹配的第二个文件的列，并将它们附加到第二个文件。因此，自1x23,1y21匹配后，我想在第二列中复制这两列，并将其附加到类变量之前。

我希望我的结果是

1x23 1x24 1y21 1y22 1y25 1z22 1x23 1y21 class
2000 3000 4000 5000 6000 7000 2000 4000 Yes
1500 1200 1100 1510 1410 1117 1500 1100 No

我使用perl来编码它使用for循环3，但由于数据非常大，它崩溃了。我认为应该有效的方法来做到这一点。

Answer 1

这是另一种选择：

use strict;
use warnings;

my ( $matrix, @cols ) = pop;
my %headings = map { chomp; $_ => 1 } <>;

push @ARGV, $matrix;
while (<>) {
    my @array = split;
    @cols = grep $headings{ $array[$_] }, 0 .. $#array if $. == 1;
    splice @array, -1, 0, @array[@cols];
    print "@array\n";
}

用法：perl script.pl vectorFile matrixFile [>outFile]

数据集输出：

1x23 1x24 1y21 1y22 1y25 1z22 1x23 1y21 class
2000 3000 4000 5000 6000 7000 2000 4000 Yes
1500 1200 1100 1510 1410 1117 1500 1100 No

使用矢量文件中的条目创建哈希。可以在矩阵文件的第一行找到的所有entires的列位置保存在@col中。来自split矩阵行的匹配列条目恰好在split矩阵行的最后一个元素之前插入。最后，新行是print ed。

希望这有帮助！

Answer 2

试试这个单行：

awk 'NR==FNR{a[$0]=1;next}FNR==1{for(i=1;i<=NF;i++)if(a[$i])k[i]}{for(x in k)$NF= sprintf("%s ",$x) $NF}7' f1 f2

更易阅读的版本：

awk 'NR==FNR{a[$0]=1;next}
     FNR==1{for(i=1;i<=NF;i++) if(a[$i])k[i]}
     {for(x in k)
          $NF= sprintf("%s ",$x) $NF}7' f1 f2

输出：

1x23 1x24 1y21 1y22 1y25 1z22 1y21 1x23 class
2000 3000 4000 5000 6000 7000 4000 2000 Yes
1500 1200 1100 1510 1410 1117 1100 1500 No

Answer 3

这是一个漫长的啰嗦但恕我直言的方法。

use strict;
use warnings;

open(my $data, '<', 'data.txt');

# read first row from the data file
my $line = <$data>;
chomp $line;

# create a list of columns
my @cols = split / /, $line;

# create hash with column indexes
my %colindex;
my $i = 0;
foreach my $colname (@cols) {
        $colindex{$colname} = $i++;
}

# Save last column ('class')
my $lastcol = pop @cols;

# get input (column names)
open(my $input, '<', 'input.txt');
my @colnames = <$input>;
close $input;

# append column names to array if there is a match
foreach (@colnames) {
        chomp;
        if (exists $colindex{$_}) {
                push @cols, $_;
        }
}

# Restore the last column
push @cols, $lastcol;

# Now process your data
open(my $out, '>', 'output.txt');

# write the header column
print $out join(" ", @cols), "\n";

while ($line = <$data>) {
        chomp $line;
        my @l = split / /, $line;
        foreach my $colname (@cols) {
                print $out $l[$colindex{$colname}], " ";
        }
        print $out "\n";
}

close $out;
close $data;

Answer 4

不确定为什么你的Perl代码会崩溃。我建议在恒定内存中运行以下算法（在Perl中实现时可能比在AWK中更可读）：

读取第一个文件并构建列名列表
读取数据文件的第一行（带有实际标题）
将两个列表相交以生成列索引列表
读取数据文件的一行行并按列分割
通过使用您在步骤3中构建的“必需”列索引列表对其进行索引来创建新的列值数组。输出它。
重复最后两步。

Answer 5

你可以尝试

awk -f app.awk file1.txt file2.txt

其中file1.txt是您的第一个文件，file2.txt是第二个文件，app.awk是

NR==FNR {
    key[$0]++
    next
}
{
    for (i=1; i<=NF; i++)
        C[FNR,i]=$i
}

END {
    for (i=1; i<=NF; i++) 
        if (C[1,i] in key) 
            k[++j]=i                
    nc=j
    for (j=1; j<=FNR; j++) {
        for (i=1; i<NF; i++) 
           printf "%s%s",C[j,i],OFS     
        for (i=1; i<=nc; i++) 
           printf "%s%s",C[j,k[i]],OFS      
        printf "%s%s",C[j,NF],RS
    }
}

匹配模式后，将右侧的列附加到awk文件

5 个答案: