我们可以使用AWK和gsub()来处理具有多个冒号的数据":" ?怎么样?

时间:2016-10-05 15:36:55

标签: regex awk sed gsub

以下是数据示例:

NaN

如您所见,其中一些列的顺序不正确......

现在,我认为将此文件导入数据框的正确方法是预处理数据,以便输出具有Col_01 .... Col_20 Col_21 Col22 Col23 Col24 Col25 8 .... 25 NaN 25134 243344 NaN NaN 17 .... NaN 75 2 79876 73453 634534 19 .... 25 32425 NaN 989423 NaN NaN 12 .... 25 23424 342421 7 13424 67 3 .... 95 32121 NaN NaN NaN 111231 值的数据框,例如

BEGIN {
    PROCINFO["sorted_in"]="@ind_str_asc" # traversal order for for(i in a)                  
}
NR==1 {       # the header cols is in the beginning of data file
              # FORGET THIS: header cols from another file replace NR==1 with NR==FNR and see * below
    split($0,a," ")                  # mkheader a[1]=first_col ...
    for(i in a) {                    # replace with a[first_col]="" ...
        a[a[i]]
        printf "%6s%s", a[i], OFS    # output the header
        delete a[i]                  # remove a[1], a[2], ...
    }
    # next                           # FORGET THIS * next here if cols from another file UNTESTED
}
{
    gsub(/: /,"=")                   # replace key-value separator ": " with "="
    split($0,b,FS)                   # split record from ","
    for(i in b) {
        split(b[i],c,"=")            # split key=value to c[1]=key, c[2]=value
        b[c[1]]=c[2]                 # b[key]=value
    }
    for(i in a)                      # go thru headers in a[] and printf from b[]
        printf "%6s%s", (i in b?b[i]:"NaN"), OFS; print ""
}

@JamesBrown在此处显示了解决方案:board.png

使用awk脚本:

cols.txt

将标题放入文本文件Col_01 Col_20 Col_21 Col_22 Col_23 Col_25

column: value

我现在的问题:如果我们的数据不是column: value1: value2: value3而是value1: value2: value3,我们如何使用awk?

我们希望数据库条目为Col_01:14:a:47 .... Col_20:25:i:z Col_21:23432:6:b Col_22:639142:4:x Col_01:8:z .... Col_20:25:i:4 Col_22:25134:u:0 Col_23:243344:5:6 Col_01:17:7:z .... Col_21:75:u:q Col_23:79876:u:0 Col_25:634534:8:1

以下是新数据:

cols.txt

我们仍然预先为gsub()

提供列

我们如何创建类似的数据库结构?是否可以使用:来限制@Override protected void onCreate(Bundle savedInstanceState) { super.onCreate(savedInstanceState); setContentView(R.layout.activity_main); Button advanceToQuestionTwo = {Button}; Button loadNewActivity = (Button) findViewById(R.id.imageButton); loadNewActivity.setOnClickListener(new View.OnClickListener() { @Override public void onClick(View view) { Intent intent = new Intent(MainActivity.this, Stundenplan.class); startActivity(intent); } }); } 之前与标题相同的第一个值?

编辑:这不必须基于awk。任何语言都会自然而然地

4 个答案:

答案 0 :(得分:3)

这是另一种选择......

$ awk -v OFS='\t' '{for(i=1;i<NF;i+=2)                  # iterate over name: value pairs
                     {c=$i;                             # copy name in c to modify
                      sub(/:/,"",c);                    # remove colon
                      a[NR,c]=$(i+1);                   # collect data by row number, name
                      cols[c]}}                         # save name
                END{n=asorti(cols,icols);               # sort names
                    for(j=1;j<=n;j++) printf "%s", icols[j] OFS;   # print header 
                    print ""; 
                    for(i=1;i<=NR;i++)                  # print data
                      {for(j=1;j<=n;j++) 
                         {v=a[i,icols[j]];             
                          printf "%s", (v?v:"NaN") OFS} # replace missing data with NaN
                       print ""}}' file | column -t     # pipe to column for pretty print

Col_01   Col_20  Col_21     Col_22      Col_23      Col_25
14:a:47  25:i:z  23432:6:b  639142:4:x  NaN         NaN
8:z      25:i:4  NaN        25134:u:0   243344:5:6  NaN
17:7:z   NaN     75:u:q     NaN         79876:u:0   634534:8:1

答案 1 :(得分:2)

我也有karakfa的回答。如果列名称没有用空格与值分隔(例如,如果你有Col_01:14:a:47)那么你可以这样做(使用GNU awk作为扩展的match函数)

  {
      for (i=1; i<=NF; i++) {
          match($i, /^([^:]+):(.*)/, m)
          a[NR,m[1]] = m[2]
          cols[m[1]]
     }
  }

END块是相同的

答案 2 :(得分:1)

使用TXR的Awk范例的Lisp macro implementation

(awk (:set ft #/-?\d+/)  ;; ft is "field tokenize" (no counterpart in Awk)
     (:let (tab (hash :equal-based)) (max-col 1) (width 8))
     ((ff (mapcar toint) (tuples 2))  ;; filter fields to int and shore up into pairs
      (set max-col (max max-col [find-max [mapcar first f]]))
      (mapdo (ado set [tab ^(,nr ,@1)] @2) f)) ;; stuff data into table
     (:end (let ((headings (mapcar (opip (format nil "Col~,02a")
                                         `@{@1 width}`)
                                   (range 1 max-col))))
             (put-line `@{headings " "}`))
           (each ((row (range 1 nr)))
             (let ((cols (mapcar (opip (or [tab ^(,row ,@1)] "NaN")
                                       `@{@1 width}`)
                                 (range 1 max-col))))
               (put-line `@{cols " "}`)))))

较小的样本数据:

Col_01: 14  Col_04: 25    Col_06: 23432    Col_07: 639142
Col_02: 8   Col_03: 25    Col_05: 25134    Col_06: 243344
Col_01: 17
Col_06: 19  Col_07: 32425

执行命令

$ txr reformat.tl data-small
Col01    Col02    Col03    Col04    Col05    Col06    Col07
14       NaN      NaN      25       NaN      23432    639142
NaN      8        25       NaN      25134    243344   NaN
17       NaN      NaN      NaN      NaN      NaN      NaN
NaN      NaN      NaN      NaN      NaN      19       32425

P.S。 opip是一个宏,它从部分函数应用程序的op宏中提升; opip隐式地将op分发到其参数表达式中,然后将生成的函数链接到一个功能管道中:因此“op - pipe”。在每个管道元素中,可以引用自己编号的隐式参数:@1@2,...如果它们不存在,则部分应用的函数隐式接收管道对象作为其最右边的参数。 / p>

^(,row ,@1)语法是TXR Lisp的反引号。主流Lisp方言用于反引号的反引用已被用于字符串准引号。这相当于(list row @1):创建一个由row的值和隐式的op/do生成的函数参数@1组成的列表。两个元素的列表被用作散列键,其模拟2D阵列。为此,哈希值必须为:equal-based。如果列表(1 2) (1 2)是单独的实例而不是同一个对象,则eql equal不是{{1}};他们在{{1}}函数下比较相等。

答案 3 :(得分:1)

只是为了好玩,一些难以理解的perl

perl -aE'%l=%{{@F}};while(($k,$v)=each%l){$c{$k}=1;$a[$.]{$k}=$v}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$a[$i]{$_}//"NaN"}@c}}' input

(社区wiki隐藏我的耻辱......)

高尔夫球的几个角色:

perl -aE'while(@F){$c{$k=shift@F}=1;$data[$.]{$k}=shift@F}END{$,="\t";say@c=sort keys%c;for$i(1..$.){say map{$data[$i]{$_}//"NaN"}@c}}' input