将一列拆分为行

时间:2013-08-29 09:24:07

标签: linux sed awk

我有一个大的制表符分隔文件,其中包含80个左右的列,如下所示:

184     
2       
P   2853263 4998463
SS      
AG0001-C        
T/T      C/C      A/A
AG0002-C        
T/T      C/C      A/T   
AG0003-C        
T/T      C/C      A/A   
AG0004-C         
T/T      C/C      T/A

我想将斜杠字符(“\”)替换为新行,以便将一列的内容拆分为两行,如下所示:

184     
2       
P   2853263 4998463
SS      
AG0001-C        
T        C         A
T        C         A
AG0002-C        
T        C         A
T        C         T
AG0003-C         
T        C         A
T        C         A
AG0004-C        
T        C         T
T        C         A

4 个答案:

答案 0 :(得分:3)

对于这样的输入(第一列左边没有初始标签):

184
2
P   2853263 4998463
SS
AG0001-C
T/T C/C A/A
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A

此脚本应与Mawk一起使用:

#!/usr/bin/awk -f

NR <= 4 || NR % 2 { print; next; }
{
    rows = 0
    for (i = 1; i <= NF; ++i) {
        count = split($i, b, /\//)
        if (count > rows) {
            rows = count
        }
        for (j = 1; j <= count; ++j) {
            key = i "|" j
            a[key] = b[j]
        }
    }
    for (i = 1; i <= rows; ++i) {
        key = 1 "|" i
        printf("%s", a[key])
        for (j = 2; j <= NF; ++j) {
            key = j "|" i
            printf("\t%s", a[key])
        }
        print ""
    }
    for (i in a) {
        delete a[i]
    }
}

输出:

184
2
P   2853263 4998463
SS
AG0001-C
T   C   A
T   C   A
AG0002-C
T   C   A
T   C   T
AG0003-C
T   C   A
T   C   A
AG0004-C
T   C   T
T   C   A

它应该可以使用不同的格式:

184
2
P   2853263 4998463
SS
AG0001-C
A/A/C/X/Y/Z T/T C/C A/A A/A/C/X A/A/B   A/A/C/X/Y
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A

输出:

184
2
P   2853263 4998463
SS
AG0001-C
A   T   C   A   A   A   A
A   T   C   A   A   A   A
C               C   B   C
X               X       X
Y                       Y
Z                       
AG0002-C
T   C   A
T   C   T
AG0003-C
T   C   A
T   C   A
AG0004-C
T   C   T
T   C   A

对于左侧有标签的输入:

    184
    2
    P   2853263 4998463
    SS
    AG0001-C
    T/T C/C A/A
    AG0002-C
    T/T C/C A/T
    AG0003-C
    T/T C/C A/A
    AG0004-C
    T/T C/C T/A

此代码

#!/usr/bin/awk -f

NR <= 4 || NR % 2 { print; next; }
{
    rows = 0
    for (i = 1; i <= NF; ++i) {
        count = split($i, b, /\//)
        if (count > rows) {
            rows = count
        }
        for (j = 1; j <= count; ++j) {
            key = i "|" j
            a[key] = b[j]
        }
    }
    for (i = 1; i <= rows; ++i) {
        for (j = 1; j <= NF; ++j) {
            key = j "|" i
            printf("\t%s", a[key])
        }
        print ""
    }
    for (i in a) {
        delete a[i]
    }
}

会输出

    184
    2
    P   2853263 4998463
    SS
    AG0001-C
    T   C   A
    T   C   A
    AG0002-C
    T   C   A
    T   C   T
    AG0003-C
    T   C   A
    T   C   A
    AG0004-C
    T   C   T
    T   C   A

答案 1 :(得分:2)

GNU awk解决方案:

$ awk '/[/]/{print $1,$3,$6;print $2,$4,$6;next}1' FS='/| +' OFS='\t' file
184
2
P   2853263 4998463
SS
AG0001-C
T       C       A
T       C       A
AG0002-C
T       C       T
T       C       T
AG0003-C
T       C       A
T       C       A
AG0004-C
T       C       A
T       C       A

答案 2 :(得分:1)

使用sed

$ sed -e "s|/|\t|g" -e "s/\([^\t]*\t[^\t]*\t[^\t]*\)\t\(.*\)/\1\n\2/" inputfile
184
2
P   2853263 4998463
SS
AG0001-C
T   T   C   
C   A   A   
AG0002-C
T   T   C   
C   A   T   
AG0003-C
T   T   C   
C   A   A   
AG0004-C
T   T   C   
C   T   A   

答案 3 :(得分:0)

这可能适合你(GNU sed):

sed '/\//!b;h;s|/.||g;G;s|./||g' file

对于包含/行的行复制。删除/和以下字符。附加复制的行并删除任何/之前的字符。