创建列联表

时间:2016-08-10 12:17:26

标签: r unix awk

我有数百个带有各种列表元素的文本文件(以千计)。下面给出了三个简化的代表文件(这里的行元素为颜色)。

group1.txt

red
blue
red
green
pink
red

group2.txt

yellow
brown
cyan
yellow
brown
red
violet
orange

group3.txt

orange
violet
pink
cyan
grey

我可以使用以下脚本创建一个排序计数表 -

awk -F '\t' '{print $1}' * | sort | uniq -c | sort -nr

>

  4 red
  2 yellow
  2 violet
  2 pink
  2 orange
  2 cyan
  2 brown
  1 grey
  1 green
  1 blue

我想创建一个列联表,如下所示 -

Colour  group1  group2  group3
red     3   1   0
green   1   0   0
blue    0   0   0
yellow  0   2   0
orange  0   1   1
grey    0   0   1
violet  0   1   1
pink    1   0   1
brown   0   2   
cyan    0   1   1

如何使用awk,python,perl或R?

创建此列联表

3 个答案:

答案 0 :(得分:5)

这是R。

中的解决方案

设置文件(这只是我们有一个例子可以使用 - 不是构建列联表的实际机制的一部分):

writeLines(c("red","blue","red","green","pink","red"),
           con="group1.txt")
writeLines(c("yellow","brown","cyan","yellow","brown","red",
             "violet","orange"),
           con="group2.txt")
writeLines(c("orange","violet","pink","cyan","grey"),
           con="group3.txt")

大部分工作都是读入和排列数据:假设我们知道文件名为groupNN.txt,其中NN是一个数字......

flist <- list.files(pattern="group[0-9]+.txt")
grpnames <- gsub("\\.txt$","",flist)

读取颜色文件:

col_list <- lapply(flist,scan,what="character")

组ID的匹配向量:

grpvec <- rep(grpnames,sapply(col_list,length))

现在只需使用table

table(unlist(col_list),grpvec)
##     grp
## col      group1 group2 group3
##   blue        1      0      0
##   brown       0      2      0
##   cyan        0      1      1
##   green       1      0      0
##   grey        0      0      1
##   orange      0      1      1
##   pink        1      0      1
##   red         3      1      0
##   violet      0      1      1
##   yellow      0      2      0

(这是按字母顺序排列的;我不确定这对你有多重要......)

答案 1 :(得分:4)

awk救援!

$ awk 'FNR==1{c++} 
             {counts[$1,c]++; keys[$1]} 
          END{print "Colour Group1 Group2 Group3"; 
              for(k in keys) {printf "%s ",k; 
                              for(i=1;i<=c;i++) printf "%s ", counts[k,i]+0;
                              print ""}}' file{1,2,3} | 
  column -t

Colour  Group1  Group2  Group3
red     3       1       0
pink    1       0       1
orange  0       1       1
blue    1       0       0
violet  0       1       1
yellow  0       2       0
grey    0       0       1
cyan    0       1       1
brown   0       2       0
green   1       0       0

答案 2 :(得分:1)

使用GNU awk实现真正的多维数组,gensub()和ARGIND:

$ cat tst.awk
{ cnt[$0][ARGIND]++ }
END {
    printf "%s%s", "Colour", OFS
    for (groupNr=1; groupNr<=ARGIND; groupNr++) {
        printf "%s%s", gensub(/\.[^.]+$/,"",1,ARGV[groupNr]), (groupNr<ARGIND ? OFS : ORS)
    }

    for (colour in cnt) {
        printf "%s%s", colour, OFS
        for (groupNr=1; groupNr<=ARGIND; groupNr++) {
            printf "%d%s", cnt[colour][groupNr], (groupNr<ARGIND ? OFS : ORS)
        }
    }
}

$ awk -f tst.awk group1.txt group2.txt group3.txt | column -t
Colour  group1  group2  group3
orange  0       1       1
cyan    0       1       1
brown   0       2       0
grey    0       0       1
red     3       1       0
yellow  0       2       0
violet  0       1       1
pink    1       0       1
green   1       0       0
blue    1       0       0