如何从所有其他行值中减去特定行值的值?

时间:2019-07-10 14:38:31

标签: bash awk

我当前的工作文件就是这样

ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10

我感兴趣的列是带有"_in"标题的列。在这些列中,我想从ID为“ Exp”的行元素中减去所有Row元素的值。 让我们考虑一下A_in列,其中"Exp"行的值为10。因此,我想从该A_in列的所有其他元素中减去10

我的业余代码是这样的(我知道这很愚蠢)

#This part is grabbing all the values in ```Exp``` row
Exp=$( awk 'BEGIN{OFS="\t";
            PROCINFO["sorted_in"] = "@val_num_asc"}
    FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
    /Exp/ {
           for (c in cols){
           shift = $cols[c]
           printf shift" "
           }
       }

        ' File.txt |paste -sd " ") 
Exp_array=($Exp)

z=1
for i in "${Exp_array[@]}"
do
z=$(echo 2+$z | bc -l)
Exp_point=$i
awk  -vd="$Exp_point" -vloop="$z" -v  '
            BEGIN{OFS="\t";
            PROCINFO["sorted_in"] = "@val_num_asc"}
            function abs(x) {return x<0?-x:x}
            FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
        NR>2{
            $loop=abs($loop-d); print
            }
         ' File.txt
done

我的第一个期望结果是

ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    0.0    0.1   10    0.1   35  
By   0.2    02     0.2   10    0.2   20  
Cz   0.3    10     0.3   05    0.3   05  
Fr   0.4    25     0.4   10    0.4   05  
Exp  0.5    0.0    0.5   0.0   0.5  0.0

现在从每个"_in"列中,我想找到2个最小值的对应ID。所以 我的第二个期望结果是

A_in   B_in  C_in
Ax     Cz    Cz 
By     Exp   Fr 
Exp          Exp

3 个答案:

答案 0 :(得分:2)

抢救Perl!

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

@ARGV = (@ARGV[0, 0]);  # Read the input file twice.

my @header = split ' ', <>;
my @in = grep $header[$_] =~ /_in$/, 0 .. $#header;
$_ = <> until eof;
my @exp = split;

my @min;
<>;
while (<>) {
    my @F = split;
    for my $i (@in) {
        $F[$i] = abs($F[$i] - $exp[$i]);
        @{ $min[$i] }[0, 1]
            = sort { $a->[0] <=> $b->[0] }
                   [$F[$i], $F[0]], grep defined, @{ $min[$i] // [] }
            unless eof;
    }
    say join "\t", @F;
}

print "\n";
say join "\t", @header[@in];
for my $index (0, 1) {
    for my $i (@in) {
        next unless $header[$i] =~ /_in$/;
        print $min[$i][$index][1], "\t";
    }
    print "\n";
}

它将读取文件两次。在第一次读取时,它只将第一行记为@header数组,将最后一行记为@exp数组。

在第二次读取中,它从每个_in列中减去相应的exp值。还将两个最小的数字存储在@min数组中与列位置相对应的位置。

格式化剩下的数字(例如,用0.0代替0,用02代替2)作为练习给读者。与将输出重定向到几个不同的文件相同。

答案 1 :(得分:1)

经过一两个小时的乐趣,我写了这个可憎的东西:

cat <<EOF >file
ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10
EOF
# fix stackoverflow formatting
# input file should be separated with tabs
<file tr -s ' ' | tr ' ' '\t' > file2
mv file2 inputfile

# read headers to an array
IFS=$'\t' read -r -a hdrs < <(head -n1 inputfile)

# exp line read into an array
IFS=$'\t' read -r -a exps < <(grep -m1 $'^Exp\t' inputfile)

# column count
colcnt="${#hdrs[@]}"
if [ "$colcnt" -eq 0 ]; then 
    echo >&2 "ERROR - must be at least one column"
    exit 1
fi

# numbers of those columns which headers have _in suffix
incolnums=$(
    paste <(
        printf "%s\n" "${hdrs[@]}"
    ) <(
        # puff, the numbers will start from zero cause bash indexes arrays from zero
        # but `cut` indexes fields from 1, so.. just keep in mind it's from 0
        seq 0 $((colcnt - 1))
    ) |
    grep $'_in\t' |
    cut -f2
)

# read the input file
{
    # preserve header line
    IFS= read -r hdrline
    ( IFS=$'\t'; printf "%s\n" "$hdrline" )

    # ok. read the file field by field
    # I think we could awk here
    while IFS=$'\t' read -a vals; do

        # for each column number with _in suffix
        while IFS= read -r incolnum; do

            # update the column value
            # I use bc for float calculations
            vals[$incolnum]=$(bc <<-EOF
                define abs(i) {
                    if (i < 0) return (-i)
                    return (i)
                }
                scale=2
                abs(${vals[$incolnum]} - ${exps[$incolnum]})
EOF
            )

        done <<<"$incolnums"

        # output the line
        ( IFS=$'\t'; printf "%s\n" "${vals[*]}" )

    done

} < inputfile > MyFirstDesiredOutcomeIsThis.txt

# ok so, first part done

{
    # output headers names with _in suffix
    printf "%s\n" "${hdrs[@]}" | 
    grep '_in$' |
    tr '\n' '\t' |
    # omg, fix tr, so stupid
    sed 's/\t$/\n/'

    # puff
    # output the corresponding ID of 2 smallest values of the specified column number
    # @arg: $1 column number
    tmpf() {
        # remove header line
        <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
        # extract only this column
        cut -f$(($1 + 1)) |
        # unique numeric sort and extract two smallest values
        sort -n -u | head -n2 |
        # now, well, extract the id's that match the numbers
        # append numbers with tab (to match the separator)
        # suffix numbers with dollar (to match end of line)
        sed 's/^/\t/; s/$/$/;' |
        # how good is grep at buffering(!)
        grep -f /dev/stdin <(
            <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
            cut -f1,$(($1 + 1))
        ) |
        # extract numbers only
        cut -f1
    }

    # the following is something like foldr $'\t' $(tmpf ...) for each $incolnums
    # we need to buffer here, we are joining the output column-wise
    output=""
    while IFS= read -r incolnum; do
        output=$(<<<$output paste - <(tmpf "$incolnum"))
    done <<<"$incolnums"

    # because with start with empty $output, paste inserts leading tabs
    # remove them ... and finally output $output
    <<<"$output" cut -f2-

}  > MySecondDesiredOutcomeIs.txt

# fix formatting to post it on stackoverflow
# files have tabs, and column will output them with space
# which is just enough
echo '==> MyFirstDesiredOutcomeIsThis.txt <=='
column -t -s$'\t' MyFirstDesiredOutcomeIsThis.txt
echo
echo '==> MySecondDesiredOutcomeIs.txt <=='
column -t -s$'\t' MySecondDesiredOutcomeIs.txt

脚本将输出:

==> MyFirstDesiredOutcomeIsThis.txt <==
ID   Time  A_in  Time  B_in  Time  C_in
Ax   0.1   0     0.1   10    0.1   35
By   0.2   2     0.2   10    0.2   20
Cz   0.3   10    0.3   5     0.3   5
Fr   0.4   25    0.4   10    0.4   5
Exp  0.5   0     0.5   0     0.5   0

==> MySecondDesiredOutcomeIs.txt <==
A_in  B_in  C_in
Ax    Cz    Cz
By    Exp   Fr
Exp         Exp

tutorialspoint上进行了编写和测试。

我使用bash和core- / more-utils来操纵文件。首先,我确定后缀为_in的列数。然后我接受存储在Exp行中的值。

然后,我仅逐行,逐字段读取文件,并且对于每个具有以后缀_in结尾的列号的列的字段,我从字段中减去字段值exp行。我认为这部分应该是最慢的(我使用普通的while IFS=$'\t' read -r -a vals),但是聪明的awk脚本可以加快此过程。如您所说,这将生成您的“第一个所需的输出”。

然后,我只需要输出以_in结尾的标题名称。然后,对于以后缀_in结尾的每个列号,我需要在该列中标识2个最小值。我使用普通的sort -n -u | head -n2。然后,这有点棘手。我需要提取此类列中具有相应2个最小值之一的ID。这是grep -f的工作。我使用sed在输入中准备适当的正则表达式,然后让grep -f /dev/stdin进行过滤。

答案 2 :(得分:0)

请一次只问一个问题。这是您要做的第一件事的方法:

$ cat tst.awk
BEGIN   { OFS="\t" }
NR==FNR { if ($1=="Exp") split($0,exps); next }
FNR==1  { $1=$1; print; next }
{
    for (i=1; i<=NF; i++) {
        val = ( (i-1) % 2 ? $i : exps[i] - $i )
        printf "%s%s", (val < 0 ? -val : val), (i<NF ? OFS : ORS)
    }
}

$ awk -f tst.awk file file
ID      Time    A_in    Time    B_in    Time    C_in
0       0.1     0       0.1     10      0.1     35
0       0.2     2       0.2     10      0.2     20
0       0.3     10      0.3     5       0.3     5
0       0.4     25      0.4     10      0.4     5
0       0.5     0       0.5     0       0.5     0

在每个UNIX机器上的任何shell中,使用任何awk都可以有效,稳健地运行上述

如果阅读此书后,重新阅读您以前收到的awk答案,然后在awk手册页中查找有关第二个问题的帮助,然后询问一个新的独立版本只是问这个问题。