我有一个大的制表符分隔文件,其中包含80个左右的列,如下所示:
184
2
P 2853263 4998463
SS
AG0001-C
T/T C/C A/A
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A
我想将斜杠字符(“\”)替换为新行,以便将一列的内容拆分为两行,如下所示:
184
2
P 2853263 4998463
SS
AG0001-C
T C A
T C A
AG0002-C
T C A
T C T
AG0003-C
T C A
T C A
AG0004-C
T C T
T C A
答案 0 :(得分:3)
对于这样的输入(第一列左边没有初始标签):
184
2
P 2853263 4998463
SS
AG0001-C
T/T C/C A/A
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A
此脚本应与Mawk一起使用:
#!/usr/bin/awk -f
NR <= 4 || NR % 2 { print; next; }
{
rows = 0
for (i = 1; i <= NF; ++i) {
count = split($i, b, /\//)
if (count > rows) {
rows = count
}
for (j = 1; j <= count; ++j) {
key = i "|" j
a[key] = b[j]
}
}
for (i = 1; i <= rows; ++i) {
key = 1 "|" i
printf("%s", a[key])
for (j = 2; j <= NF; ++j) {
key = j "|" i
printf("\t%s", a[key])
}
print ""
}
for (i in a) {
delete a[i]
}
}
输出:
184
2
P 2853263 4998463
SS
AG0001-C
T C A
T C A
AG0002-C
T C A
T C T
AG0003-C
T C A
T C A
AG0004-C
T C T
T C A
它应该可以使用不同的格式:
184
2
P 2853263 4998463
SS
AG0001-C
A/A/C/X/Y/Z T/T C/C A/A A/A/C/X A/A/B A/A/C/X/Y
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A
输出:
184
2
P 2853263 4998463
SS
AG0001-C
A T C A A A A
A T C A A A A
C C B C
X X X
Y Y
Z
AG0002-C
T C A
T C T
AG0003-C
T C A
T C A
AG0004-C
T C T
T C A
对于左侧有标签的输入:
184
2
P 2853263 4998463
SS
AG0001-C
T/T C/C A/A
AG0002-C
T/T C/C A/T
AG0003-C
T/T C/C A/A
AG0004-C
T/T C/C T/A
此代码
#!/usr/bin/awk -f
NR <= 4 || NR % 2 { print; next; }
{
rows = 0
for (i = 1; i <= NF; ++i) {
count = split($i, b, /\//)
if (count > rows) {
rows = count
}
for (j = 1; j <= count; ++j) {
key = i "|" j
a[key] = b[j]
}
}
for (i = 1; i <= rows; ++i) {
for (j = 1; j <= NF; ++j) {
key = j "|" i
printf("\t%s", a[key])
}
print ""
}
for (i in a) {
delete a[i]
}
}
会输出
184
2
P 2853263 4998463
SS
AG0001-C
T C A
T C A
AG0002-C
T C A
T C T
AG0003-C
T C A
T C A
AG0004-C
T C T
T C A
答案 1 :(得分:2)
GNU awk
解决方案:
$ awk '/[/]/{print $1,$3,$6;print $2,$4,$6;next}1' FS='/| +' OFS='\t' file
184
2
P 2853263 4998463
SS
AG0001-C
T C A
T C A
AG0002-C
T C T
T C T
AG0003-C
T C A
T C A
AG0004-C
T C A
T C A
答案 2 :(得分:1)
使用sed
:
$ sed -e "s|/|\t|g" -e "s/\([^\t]*\t[^\t]*\t[^\t]*\)\t\(.*\)/\1\n\2/" inputfile
184
2
P 2853263 4998463
SS
AG0001-C
T T C
C A A
AG0002-C
T T C
C A T
AG0003-C
T T C
C A A
AG0004-C
T T C
C T A
答案 3 :(得分:0)
这可能适合你(GNU sed):
sed '/\//!b;h;s|/.||g;G;s|./||g' file
对于包含/
行的行复制。删除/
和以下字符。附加复制的行并删除任何/
之前的字符。