从一些文件中复制字符串并将其粘贴到bash中的新文件中

时间:2019-12-04 22:25:18

标签: bash unix awk bioinformatics fasta

我有一些包含fasta数据的文件。同一目录中的所有文件都具有不同的名称。 文件1

>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG

file2

>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT

我想为每个基因创建一个新文件。文件名将是基因名,应该看起来像这样

gene1

>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

2 个答案:

答案 0 :(得分:2)

能否请您尝试以下。仅使用提供的示例进行测试和编写。

awk '
/^>/{
  sub(/^>/,"")
  file=$0
  print ">"FILENAME >> (file)
  next
}
{
  print >> (file)
  close(file)
}
' file*

对于提供的示例,它将创建两个名为gene1gene2的输出文件,如下所示。

cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC

cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT

说明: 在此处添加上述代码的说明。

awk '                              ##Starting awk program from here.
/^>/{                              ##Checking a condition if a line starts from > as per samples.
  sub(/^>/,"")                     ##Substituting that starting > with NULL here.
  file=$0                          ##Creating a variable named file whose value is current line.
  print ">"FILENAME >> (file)      ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
  next                             ##next will skip all further lines from here.
}                                  ##Closing BLOCK for /^>/ condition here.
{                                  ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
  print >> (file)                  ##Printing current line to output file named variable file value here.
  close(file)                      ##Using close; to close the output file in back-end, to avoid too many files opened error.
}                                  ##Closing BLOCK as mentioned above for this program.
' file*                            ##Passing all files here.

答案 1 :(得分:1)

对于您的问题,几乎没有任何假设

  • 每个“基因”都有一个标头,以>开头
  • 然后是一行内容(或更多)
  • 假设文件超过2个,基因超过2个

这是任何程序检测模式并进行过滤/拆分的条件

伪代码

for files in folder
  for line in file
    if it's gene, save as target_file_name
    if not, push current_file_name and current_line to target_file_name

让我知道这是否满足您的要求,或者您需要进一步的实现/详细代码,bashawk都应该起作用。