我有数百个带有各种列表元素的文本文件(以千计)。下面给出了三个简化的代表文件(这里的行元素为颜色)。
group1.txt
red
blue
red
green
pink
red
group2.txt
yellow
brown
cyan
yellow
brown
red
violet
orange
group3.txt
orange
violet
pink
cyan
grey
我可以使用以下脚本创建一个排序计数表 -
awk -F '\t' '{print $1}' * | sort | uniq -c | sort -nr
>
4 red
2 yellow
2 violet
2 pink
2 orange
2 cyan
2 brown
1 grey
1 green
1 blue
我想创建一个列联表,如下所示 -
Colour group1 group2 group3
red 3 1 0
green 1 0 0
blue 0 0 0
yellow 0 2 0
orange 0 1 1
grey 0 0 1
violet 0 1 1
pink 1 0 1
brown 0 2
cyan 0 1 1
如何使用awk,python,perl或R?
创建此列联表答案 0 :(得分:5)
这是R。
中的解决方案设置文件(这只是我们有一个例子可以使用 - 不是构建列联表的实际机制的一部分):
writeLines(c("red","blue","red","green","pink","red"),
con="group1.txt")
writeLines(c("yellow","brown","cyan","yellow","brown","red",
"violet","orange"),
con="group2.txt")
writeLines(c("orange","violet","pink","cyan","grey"),
con="group3.txt")
大部分工作都是读入和排列数据:假设我们知道文件名为groupNN.txt
,其中NN
是一个数字......
flist <- list.files(pattern="group[0-9]+.txt")
grpnames <- gsub("\\.txt$","",flist)
读取颜色文件:
col_list <- lapply(flist,scan,what="character")
组ID的匹配向量:
grpvec <- rep(grpnames,sapply(col_list,length))
现在只需使用table
:
table(unlist(col_list),grpvec)
## grp
## col group1 group2 group3
## blue 1 0 0
## brown 0 2 0
## cyan 0 1 1
## green 1 0 0
## grey 0 0 1
## orange 0 1 1
## pink 1 0 1
## red 3 1 0
## violet 0 1 1
## yellow 0 2 0
(这是按字母顺序排列的;我不确定这对你有多重要......)
答案 1 :(得分:4)
awk
救援!
$ awk 'FNR==1{c++}
{counts[$1,c]++; keys[$1]}
END{print "Colour Group1 Group2 Group3";
for(k in keys) {printf "%s ",k;
for(i=1;i<=c;i++) printf "%s ", counts[k,i]+0;
print ""}}' file{1,2,3} |
column -t
Colour Group1 Group2 Group3
red 3 1 0
pink 1 0 1
orange 0 1 1
blue 1 0 0
violet 0 1 1
yellow 0 2 0
grey 0 0 1
cyan 0 1 1
brown 0 2 0
green 1 0 0
答案 2 :(得分:1)
使用GNU awk实现真正的多维数组,gensub()和ARGIND:
$ cat tst.awk
{ cnt[$0][ARGIND]++ }
END {
printf "%s%s", "Colour", OFS
for (groupNr=1; groupNr<=ARGIND; groupNr++) {
printf "%s%s", gensub(/\.[^.]+$/,"",1,ARGV[groupNr]), (groupNr<ARGIND ? OFS : ORS)
}
for (colour in cnt) {
printf "%s%s", colour, OFS
for (groupNr=1; groupNr<=ARGIND; groupNr++) {
printf "%d%s", cnt[colour][groupNr], (groupNr<ARGIND ? OFS : ORS)
}
}
}
$ awk -f tst.awk group1.txt group2.txt group3.txt | column -t
Colour group1 group2 group3
orange 0 1 1
cyan 0 1 1
brown 0 2 0
grey 0 0 1
red 3 1 0
yellow 0 2 0
violet 0 1 1
pink 1 0 1
green 1 0 0
blue 1 0 0