根据唯一ID转置数据 - awk

时间:2016-04-29 20:00:38

标签: linux awk gawk

我真的希望你能提供帮助。我对(g)awk完全不熟悉,过去两周我一直在和它斗争。

我的原始文件如下 - 有一个列有唯一的ID,另一个有唯一的名称。后续列是各种课程,每个字段包含(当不为空时)每个课程和每个学生的标记。因此,每个学生每门课程只有一个标记:

Id  Name        Course1 Course2 Course3 Course4 Course5
1   John           55
2   George                                         63
4   Alex                          64
1   John                                   74
3   Emma           63
2   George                64
4   Alex                                   60
2   George         29                   
3   Emma                                           69
1   John                  67
3   Emma                  80
4   Alex           57
2   George                                 91
1   John                          81
1   John                                           34
3   Emma                          75
2   George                        89
4   Alex                                           49
3   Emma                                   78
4   Alex                  69
5   TERRY                 67
6   HELEN                         39 

这就是我想要达到的目的 - 根据唯一ID转置数据,即标记,并将标记放在每个相应的课程下面,如下所示:

Id  Name        Course1 Course2 Course3 Course4 Course5
1   John          55      69       64     60      49
2   George        29      64       89     91      63
3   Emma          63      80       75     78      69
4   Alex          57      69       64     60      49
5   TERRY                 67
6   HELLEN                         39

这是我迄今为止所做的:

Id  Name        Course1 Course2 Course3 Course4 Course5
1   John          55            
2   George        29            
3   Emma          63            
4   Alex          57    
5   TERRY
6   HELLEN      
1   John                  69            
2   George                64            
3   Emma                  80            
4   Alex                  69            
5   TERRY                 67
6   HELLEN
1   John                           64
2   George                         89
3   Emma                           75
4   Alex                           64
5   TERRY
6   HELLEN                         39
                                        ...and so on

根据我在awk上已经知道的内容实现这一点真的有点棘手(请注意我对基于sed / perl e.t.c.的解决方案不感兴趣)。 如果要提供一些帮助(最好不是一个班轮),我可能会要求有点描述性,因为我对解决方案感兴趣,就像我在方法本身一样。

非常感谢任何帮助。

EDIT 这是我为达到最后阶段所写的代码(以及我遇到的问题)

#!/bin/bash

files3="*.csv"
for j in $files3
do
    #echo "processing $j..."
    fi13=$(awk -F" " '(NR==1){field13=$13;}{print field13}' ./work1/test1YA.csv)
    fi14=$(awk -F" " '(NR==1){field14=$14;}{print field14}' ./work1/test1YA.csv)
    fi15=$(awk -F" " '(NR==1){field15=$15;}{print field15}' ./work1/test1YA.csv)
    fi16=$(awk -F" " '(NR==1){field16=$16;}{print field16}' ./work1/test1YA.csv)

#   awk -F" " 'BEGIN{OFS=" ";RS="\n"}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12}' "$j" >> ./work1/test2YA.csv
    awk -F" " -v f13="$fi13" -v f14="$fi14" -v f15="$fi15" -v f16="$fi16" '{if($13==f13){$13=$6;$14=$15=$16=""}if($13==f14){$14=$6;$13=$15=$16=""}if($13==f15){$15=$6;$13=$14=$16=""}if($13==f16){$16=$6;$13=$14=$15=""}{print $1,$2,$3,$4,$5,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16}}' "$j" >> ./work1/test2YA.csv

done;

awk -F" " 'BEGIN{print "ID","Title","FirstName","MiddleName","LastName","FinalMarks","Status","Username","Campus","Code","Programme","Year","course1","course2","course3","course4"}{print}' ./work1/test2YA.csv >> ./work1/test3YA.csv

5 个答案:

答案 0 :(得分:1)

这是gnu awk的解决方案:

<强> course.awk

BEGIN { # setup field width for constant field splitting
        FIELDWIDTHS = "2 2 12 7 1 7 1 7 1 7 1 7"
        # setup sort order (by id)
        PROCINFO["sorted_in"] = "@ind_num_asc"
      }

NR == 1 { # print header
          print
          next
        }

      {
        # add ids to names
        names[ $1 ] = $3

        # store under id and course number the mark if it is present
        for( c = 1; c <= 5; c++ ) {
          field = 2+ (c*2)
          if( $(field) !~ /^ *$/ ) {
            marks[ $1, c ] = $(field)
          }
        }
      }

END   {
        # output
        for( id in names ) {
          printf("%-4s%-12s%7s %7s %7s %7s %7s\n",id, names[ id ], marks[ id, 1], marks[ id, 2], marks[ id, 3], marks[ id, 4], marks[ id, 5])
        }
      }

像这样使用:awk -f course.awk your_file

输入不是制表符分隔,但具有固定的列宽,这有点不明显:

  • 使用来自FIELDWIDTHS的FIELDWIDTHS%Ns,其中N是派生的
  • FIELDWIDTHS考虑了ID和Name,Course1和Course2之间的空列,...
  • 检查是否存在标记:if( $(field) !~ /^ *$/ )检查字段是否完全由空格组成。

答案 1 :(得分:0)

这可能是awk中的近似值:

NR==1{
    for(x=1;x<=NF;x++)
    {
        head=head $x"\t";
    }
    print head
}
NR>1{
    for(i=3;i<=NF;i++)
    {
        students[$1"\t"$2]=students[$1"\t"$2] "\t"$i;
    }
}
END{
    for (stu in students)
    {
        print stu,students[stu];
    }
}

Id      Name    Course1 Course2 Course3 Course4 Course5
5       TERRY   67
4       Alex    64      60      57      49      69
1       John    55      74      67      81      34
6       HELEN   39
3       Emma    63      69      80      75      78
2       George  63      64      29      91      89

答案 2 :(得分:0)

相同的想法,也许更简单

$ awk 'BEGIN{ FIELDWIDTHS="16 8 8 8 8 8"} 
       NR==1{print;next} 
        NR>1{keys[$1]; 
             for(i=2;i<=6;i++) 
                {gsub(" ","",$i); 
                 if($i) a[$1,i]=$i}} 
         END{for(k in keys) 
                {printf "%16s",k; 
                 for(i=2;i<=6;i++) printf "%-8s",a[k,i]; 
                 print ""}}' file


Id  Name        Course1 Course2 Course3 Course4 Course5
3   Emma        63      80      75      78      69
4   Alex        57      69      64      60      49
6   HELEN                       39
5   TERRY               67
1   John        55      67      81      74      34
2   George      29      64      89      91      63

您也可以通过管道到sort -n

对输出进行排序
... | sort -n

Id  Name        Course1 Course2 Course3 Course4 Course5
1   John        55      67      81      74      34
2   George      29      64      89      91      63
3   Emma        63      80      75      78      69
4   Alex        57      69      64      60      49
5   TERRY               67
6   HELEN                       39

答案 3 :(得分:0)

使用GNU awk进行FIELDWIDTHS,2D数组和sorted_in:

$ cat tst.awk
NR==1 {
    print
    split($0,f,/\S+\s*/,s)
    for (i=1;i in s;i++) {
        w[i] = length(s[i])
        FIELDWIDTHS = FIELDWIDTHS (i>1?" ":"") w[i]
    }
    next
}
{
    sub(/\s*$/,"  ")
    for (i=1;i<=NF;i++) {
        if ($i ~ /\S/) {
            val[$1][i] = $i
        }
    }
}
END {
    PROCINFO["sorted_in"] = "@ind_num_asc"
    for (id in val) {
        for (i=1;i<=NF;i++) {
            printf "%*s", w[i], val[id][i]
        }
        print ""
    }
}

$ awk -f tst.awk file
Id  Name        Course1 Course2 Course3 Course4 Course5
1   John            55      67      81      74     34
2   George          29      64      89      91     63
3   Emma            63      80      75      78     69
4   Alex            57      69      64      60     49
5   TERRY                   67
6   HELEN                           39

答案 4 :(得分:0)

这是我对此的看法。这适用于普通的awk(不使用FIELDWIDTHS),它会自动调整到不同数量的字段(即添加Course7列,你应该没问题)。此外,您可以将其指向多个文件,并且应该单独处理每个文件。

#!/usr/bin/awk -f

# Initialize variables on the first record of each input file
# (and also print the header)
#
FNR <= 1 {
  print
  delete name
  delete score
  next
}

# Process each line.
#
{
  id = substr($0, 0, 16)    #
  name[id]                 # Store the unique identifier in an array
  pos = 0                  #

  # Step through the score fields until we hit the end of the line,
  # storing scores in another array.
  do {
    score[id, pos] += substr($0,17+pos*8,8) +0
    printf("id='%s' pos=%s value=%s total=%s\n", id, pos, substr($0,17+pos*8,8)+0, score[id, pos] );
  } while (17+(++pos)*8 < length())
}

# Keep track of our maximum number of fields
pos>max { max=pos }

# Finally, generate our (randomly sorted) output.
END {
  for (id in name) {        # Step through the records...
    printf("%-12s", id);
    for (i=0; i<max; i++) { # Step through the fields...
      if (score[id, i]==0) score[id, i]=""
      printf("%-8s", score[id, i]);
    }
    printf("\n")
  }
}

它有点长,但我认为它更容易理解它的作用。