Question

我有一个文件重塑问题，我认为可以用单线完成，但是我是sed和awk的新手（而且还有堆栈溢出！）。我肯定会失去耐心，并在R中做到这一点，但我认为拥有此类命令供以后使用可能会很有趣。

我有一个带有“簇”的txt文件，如下所示：

>Cluster 15425
0   1096aa, >d7719f16-11db-48c4-... *
>Cluster 15426
0   1096aa, >fd7eacf9-37cd-4b40-... *
1   436aa, >cfd4b1b0-30df-471e-... at 80.28%
2   413aa, >5992f56b-0269-4add-... at 86.68%
3   395aa, >d3be5814-b2e8-41fe-... at 89.37%
4   239aa, >9e25fbb9-9f6c-4f52-... at 80.33%
>Cluster 15427
0   1096aa, >6c8790d1-5a8b-42d4-... *
>Cluster 15428
0   1096aa, >0c00bc15-51aa-4676-... *
>Cluster 15429
0   1096aa, >1d8ab161-3aab-45a0-... *
>Cluster 15430
0   1096aa, >ef6694d2-a0e6-4bd1-... *
1   410aa, >313eee0a-e8c0-4e8c-... at 84.63%

应该这样写：

集群n°15425具有一项称为> d7719f16-11db-48c4 -...

集群n°14426有5个项目，分别是> fd7eacf9-37cd-4b40 -...，> cfd4b1b0-30df-471e -...等，直至> 9e25fbb9-9f6c-4f52 -...

我想要的是将该文件作为输入，并将具有X个以上项目的所有群集吐出另一个文件。直观地，它应该查找以“>”开头的行，并在这两行之间的行数大于X时进行打印。

对于X = 1，输出文件应包含：

>Cluster 15426
0   1096aa, >fd7eacf9-37cd-4b40-... *
1   436aa, >cfd4b1b0-30df-471e-... at 80.28%
2   413aa, >5992f56b-0269-4add-... at 86.68%
3   395aa, >d3be5814-b2e8-41fe-... at 89.37%
4   239aa, >9e25fbb9-9f6c-4f52-... at 80.33%
>Cluster 15430
0   1096aa, >ef6694d2-a0e6-4bd1-... *
1   410aa, >313eee0a-e8c0-4e8c-... at 84.63%

（只有n°15426和15430集群有一个以上的项目）

谢谢您的帮助！

Answer 1

根据显示的示例，您可以尝试在GNU awk中进行以下操作，编写和测试，在这里两次读取Input_file。

awk '
FNR==NR{
  if($0~/^>/){
    ++count
    header[count]=$0
  }
  else{
    a[count]++
    b[count]=(b[count]?b[count] ORS:"")$0
  }
  next
}
/^>/ && a[++count1]>1{
  print header[count1] ORS b[count1]
}
'  Input_file  Input_file

说明： 添加以上详细说明。

awk '                                           ##Starting awk program from here.
FNR==NR{                                        ##Checking condition FNR==NR which will be TRUE when first time Input_file is being read.
  if($0~/^>/){                                  ##Checking condition if line starts from ^then do following.
    ++count                                     ##Increment 1 with count here.
    header[count]=$0                            ##Creating header array with index of count and its value is current line.
  }
  else{                                         ##mentioning else of above here.
    a[count]++                                  ##Creating array a with index of count and keep increasing its value with 1.
    b[count]=(b[count]?b[count] ORS:"")$0       ##Creating array b with index of count and keep concatenating its values with new line here.
  }
  next                                          ##next will skip all further statements from here.
}
/^>/ && a[++count1]>1{                          ##Checking condition if line starts from > AND value of array a with index of count1 is greater than 1 then do following.
  print header[count1] ORS b[count1]            ##Printing header with index count1 and array b with index of count1 here.
}
'  Input_file Input_file                        ##Mentioning Input_file names here.

Answer 2

另一个awk，需要多字符RS支持（例如gawk）。

$ awk -F'\n' -v RS='\n>' 'NF>2{printf "%s", rt $0} {rt=RT}' file

>Cluster 15426
0   1096aa, >fd7eacf9-37cd-4b40-... *
1   436aa, >cfd4b1b0-30df-471e-... at 80.28%
2   413aa, >5992f56b-0269-4add-... at 86.68%
3   395aa, >d3be5814-b2e8-41fe-... at 89.37%
4   239aa, >9e25fbb9-9f6c-4f52-... at 80.33%
>Cluster 15430
0   1096aa, >ef6694d2-a0e6-4bd1-... *
1   410aa, >313eee0a-e8c0-4e8c-... at 84.63%

可以简化

$ awk -F'\n' -v RS='\n>' 'NF>2{print ">" $0}' file

请注意，在第一个选项的开头和第二个选项的末尾还有一个额外的新行。

Answer 3

以下perl解决方案有效。

perl -ne '
    BEGIN { $N = 1 }
    if (/^>/) {
        print @b if @b > $N+1;
        @b = ();
    }
    push @b, $_;
    END {
        print @b if @b > $N+1
    }' input_file

使用相同方法的awk解决方案：

awk '
    BEGIN { N = 1 }
    /^>/ {
        if (nb>N+1) for (i=0; i<nb; i++) print b[i];
        nb = 0; delete b;
    }
    { b[nb++]= $0; }
    END {
        if (nb>N+1) for (i=0; i<nb; i++) print b[i];
    }' input_file

Answer 4

$ cat tst.awk
/^>/ { prt() }
{ rec = (cnt++ ? rec ORS : "") $0 }
END { prt() }

function prt() {
    if ( cnt > (x+1) ) {
        print rec
    }
    rec = cnt = ""
}

。

$ awk -v x=1 -f tst.awk file
>Cluster 15426
0   1096aa, >fd7eacf9-37cd-4b40-... *
1   436aa, >cfd4b1b0-30df-471e-... at 80.28%
2   413aa, >5992f56b-0269-4add-... at 86.68%
3   395aa, >d3be5814-b2e8-41fe-... at 89.37%
4   239aa, >9e25fbb9-9f6c-4f52-... at 80.33%
>Cluster 15430
0   1096aa, >ef6694d2-a0e6-4bd1-... *
1   410aa, >313eee0a-e8c0-4e8c-... at 84.63%

Answer 5

您没有提供任何脚本，因此我不会为您提供完整的答案，但是我可以给您一个开始：为了计算文件的行数，您可以使用wc -l ：

wc -l file.txt
12 file.txt

您可以使用awk仅保留结果的第一部分（行数）。

如果您对行数感兴趣，并遵循特定条件，则可以使用grep和wc -l的组合：

grep <something> file.txt | wc -l
3

（很明显，如果file.txt中有三遍<something>）

这为您的脚本提供了一个良好的开端。

打印每行多于n行的部分

5 个答案: