
时间:2019-07-10 14:38:31

标签: bash awk


ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10

我感兴趣的列是带有"_in"标题的列。在这些列中,我想从ID为“ Exp”的行元素中减去所有Row元素的值。 让我们考虑一下A_in列,其中"Exp"行的值为10。因此,我想从该A_in列的所有其他元素中减去10


#This part is grabbing all the values in ```Exp``` row
Exp=$( awk 'BEGIN{OFS="\t";
            PROCINFO["sorted_in"] = "@val_num_asc"}
    FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
    /Exp/ {
           for (c in cols){
           shift = $cols[c]
           printf shift" "

        ' File.txt |paste -sd " ") 

for i in "${Exp_array[@]}"
z=$(echo 2+$z | bc -l)
awk  -vd="$Exp_point" -vloop="$z" -v  '
            PROCINFO["sorted_in"] = "@val_num_asc"}
            function abs(x) {return x<0?-x:x}
            FNR==1 { for (n=2;n<=NF;n++) { if ($n ~ /_GasOut$/) cols[$n]=n; }}
            $loop=abs($loop-d); print
         ' File.txt


ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    0.0    0.1   10    0.1   35  
By   0.2    02     0.2   10    0.2   20  
Cz   0.3    10     0.3   05    0.3   05  
Fr   0.4    25     0.4   10    0.4   05  
Exp  0.5    0.0    0.5   0.0   0.5  0.0

现在从每个"_in"列中,我想找到2个最小值的对应ID。所以 我的第二个期望结果是

A_in   B_in  C_in
Ax     Cz    Cz 
By     Exp   Fr 
Exp          Exp

use warnings;
use strict;
use feature qw{ say };

@ARGV = (@ARGV[0, 0]);  # Read the input file twice.

my @header = split ' ', <>;
my @in = grep $header[$_] =~ /_in$/, 0 .. $#header;
$_ = <> until eof;
my @exp = split;

my @min;
while (<>) {
    my @F = split;
    for my $i (@in) {
        $F[$i] = abs($F[$i] - $exp[$i]);
        @{ $min[$i] }[0, 1]
            = sort { $a->[0] <=> $b->[0] }
                   [$F[$i], $F[0]], grep defined, @{ $min[$i] // [] }
            unless eof;
    say join "\t", @F;

print "\n";
say join "\t", @header[@in];
for my $index (0, 1) {
    for my $i (@in) {
        next unless $header[$i] =~ /_in$/;
        print $min[$i][$index][1], "\t";
    print "\n";




cat <<EOF >file
ID   Time   A_in   Time  B_in  Time  C_in
Ax   0.1    10     0.1   15    0.1   45  
By   0.2    12     0.2   35    0.2   30  
Cz   0.3    20     0.3   20    0.3   15  
Fr   0.4    35     0.4   15    0.4   05  
Exp  0.5    10     0.5   25    0.5   10
# fix stackoverflow formatting
# input file should be separated with tabs
<file tr -s ' ' | tr ' ' '\t' > file2
mv file2 inputfile

# read headers to an array
IFS=$'\t' read -r -a hdrs < <(head -n1 inputfile)

# exp line read into an array
IFS=$'\t' read -r -a exps < <(grep -m1 $'^Exp\t' inputfile)

# column count
if [ "$colcnt" -eq 0 ]; then 
    echo >&2 "ERROR - must be at least one column"
    exit 1

# numbers of those columns which headers have _in suffix
    paste <(
        printf "%s\n" "${hdrs[@]}"
    ) <(
        # puff, the numbers will start from zero cause bash indexes arrays from zero
        # but `cut` indexes fields from 1, so.. just keep in mind it's from 0
        seq 0 $((colcnt - 1))
    ) |
    grep $'_in\t' |
    cut -f2

# read the input file
    # preserve header line
    IFS= read -r hdrline
    ( IFS=$'\t'; printf "%s\n" "$hdrline" )

    # ok. read the file field by field
    # I think we could awk here
    while IFS=$'\t' read -a vals; do

        # for each column number with _in suffix
        while IFS= read -r incolnum; do

            # update the column value
            # I use bc for float calculations
            vals[$incolnum]=$(bc <<-EOF
                define abs(i) {
                    if (i < 0) return (-i)
                    return (i)
                abs(${vals[$incolnum]} - ${exps[$incolnum]})

        done <<<"$incolnums"

        # output the line
        ( IFS=$'\t'; printf "%s\n" "${vals[*]}" )


} < inputfile > MyFirstDesiredOutcomeIsThis.txt

# ok so, first part done

    # output headers names with _in suffix
    printf "%s\n" "${hdrs[@]}" | 
    grep '_in$' |
    tr '\n' '\t' |
    # omg, fix tr, so stupid
    sed 's/\t$/\n/'

    # puff
    # output the corresponding ID of 2 smallest values of the specified column number
    # @arg: $1 column number
    tmpf() {
        # remove header line
        <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
        # extract only this column
        cut -f$(($1 + 1)) |
        # unique numeric sort and extract two smallest values
        sort -n -u | head -n2 |
        # now, well, extract the id's that match the numbers
        # append numbers with tab (to match the separator)
        # suffix numbers with dollar (to match end of line)
        sed 's/^/\t/; s/$/$/;' |
        # how good is grep at buffering(!)
        grep -f /dev/stdin <(
            <MyFirstDesiredOutcomeIsThis.txt tail -n+2 |
            cut -f1,$(($1 + 1))
        ) |
        # extract numbers only
        cut -f1

    # the following is something like foldr $'\t' $(tmpf ...) for each $incolnums
    # we need to buffer here, we are joining the output column-wise
    while IFS= read -r incolnum; do
        output=$(<<<$output paste - <(tmpf "$incolnum"))
    done <<<"$incolnums"

    # because with start with empty $output, paste inserts leading tabs
    # remove them ... and finally output $output
    <<<"$output" cut -f2-

}  > MySecondDesiredOutcomeIs.txt

# fix formatting to post it on stackoverflow
# files have tabs, and column will output them with space
# which is just enough
echo '==> MyFirstDesiredOutcomeIsThis.txt <=='
column -t -s$'\t' MyFirstDesiredOutcomeIsThis.txt
echo '==> MySecondDesiredOutcomeIs.txt <=='
column -t -s$'\t' MySecondDesiredOutcomeIs.txt


==> MyFirstDesiredOutcomeIsThis.txt <==
ID   Time  A_in  Time  B_in  Time  C_in
Ax   0.1   0     0.1   10    0.1   35
By   0.2   2     0.2   10    0.2   20
Cz   0.3   10    0.3   5     0.3   5
Fr   0.4   25    0.4   10    0.4   5
Exp  0.5   0     0.5   0     0.5   0

==> MySecondDesiredOutcomeIs.txt <==
A_in  B_in  C_in
Ax    Cz    Cz
By    Exp   Fr
Exp         Exp


我使用bash和core- / more-utils来操纵文件。首先,我确定后缀为_in的列数。然后我接受存储在Exp行中的值。

然后,我仅逐行,逐字段读取文件,并且对于每个具有以后缀_in结尾的列号的列的字段,我从字段中减去字段值exp行。我认为这部分应该是最慢的(我使用普通的while IFS=$'\t' read -r -a vals),但是聪明的awk脚本可以加快此过程。如您所说,这将生成您的“第一个所需的输出”。

然后,我只需要输出以_in结尾的标题名称。然后,对于以后缀_in结尾的每个列号,我需要在该列中标识2个最小值。我使用普通的sort -n -u | head -n2。然后,这有点棘手。我需要提取此类列中具有相应2个最小值之一的ID。这是grep -f的工作。我使用sed在输入中准备适当的正则表达式,然后让grep -f /dev/stdin进行过滤。

$ cat tst.awk
BEGIN   { OFS="\t" }
NR==FNR { if ($1=="Exp") split($0,exps); next }
FNR==1  { $1=$1; print; next }
    for (i=1; i<=NF; i++) {
        val = ( (i-1) % 2 ? $i : exps[i] - $i )
        printf "%s%s", (val < 0 ? -val : val), (i<NF ? OFS : ORS)

$ awk -f tst.awk file file
ID      Time    A_in    Time    B_in    Time    C_in
0       0.1     0       0.1     10      0.1     35
0       0.2     2       0.2     10      0.2     20
0       0.3     10      0.3     5       0.3     5
0       0.4     25      0.4     10      0.4     5
0       0.5     0       0.5     0       0.5     0

