我正在尝试在标签限定文件中查看来自特定字段(第1列和第4列)的重复行,并从重复字段块的第一行和最后一行中提取特定列;只有前面的字段相同且值都高于0.例如:
如果两列($ 1和$ 4)在其他位置散布的不同位置相同,则需要将它们视为单独的块
示例输入:
1 tmp1 153446387 153446446 -0.2 1.0888042
2 tmp1 153446925 153446973 0 0.87891006
3 tmp1 153451902 153451951 1.43854 1.2709045
4 tmp1 153454056 153454105 1.43854 1.4132746
5 tmp1 153456192 153456250 1.43854 0.87553155
6 tmp1 153458717 153458776 1.335858 1.1829022
7 tmp1 153460782 153460841 1.335858 0.006651476
8 tmp1 153462035 153462094 0 0.13484457
9 tmp1 153463690 153463749 1.43854 0.45511296
10 tmp1 153467589 153467673 1.43854 1.4431274
11 tmp1 153467873 153468632 0.31841 1.70443
12 tmp1 154451904 154451951 1.43854 1.3709045
13 tmp1 154454054 154454109 1.43854 1.132746
14 tmp1 154456194 154456259 1.43854 0.8553
15 tmp2 153472147 153472194 1.43854 0.99288875
16 tmp2 153476511 153476559 0 0.99288875
输出:
tmp1 153451902 153456250 1.43854
tmp1 153458717 153460841 1.335858
tmp1 153463690 153467673 1.43854
tmp1 154451904 154456259 1.43854
tmp2 153472147 153472194 1.43854
关于如何解决这个问题的任何想法
答案 0 :(得分:2)
awk '
BEGIN {OFS = FS = "\t"}
function output(key, ary) {
split(key, ary, FS)
print ary[1], start, end, ary[2]
}
$4 <= 0 {next}
key != $1 FS $4 {
if (end) {output(key)}
key = $1 FS $4
start = $2
}
{end = $3}
END {output(key)}
' filename