我有一些包含fasta数据的文件。同一目录中的所有文件都具有不同的名称。 文件1
>gene1
AAAAAAAAAAAAAAAAAAAA
>gene2
GGGGGGGGGGGGGGGGGGGG
file2
>gene1
CCCCCCCCCCCCCCCCCCCC
>gene2
TTTTTTTTTTTTTTTTTTTT
我想为每个基因创建一个新文件。文件名将是基因名,应该看起来像这样
gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
答案 0 :(得分:2)
能否请您尝试以下。仅使用提供的示例进行测试和编写。
awk '
/^>/{
sub(/^>/,"")
file=$0
print ">"FILENAME >> (file)
next
}
{
print >> (file)
close(file)
}
' file*
对于提供的示例,它将创建两个名为gene1
和gene2
的输出文件,如下所示。
cat gene1
>file1
AAAAAAAAAAAAAAAAAAAA
>file2
CCCCCCCCCCCCCCCCCCCC
cat gene2
>file1
GGGGGGGGGGGGGGGGGGGG
>file2
TTTTTTTTTTTTTTTTTTTT
说明: 在此处添加上述代码的说明。
awk ' ##Starting awk program from here.
/^>/{ ##Checking a condition if a line starts from > as per samples.
sub(/^>/,"") ##Substituting that starting > with NULL here.
file=$0 ##Creating a variable named file whose value is current line.
print ">"FILENAME >> (file) ##Printing string > and awk variable FILENAME to output file variable named file; created in previous line.
next ##next will skip all further lines from here.
} ##Closing BLOCK for /^>/ condition here.
{ ##Starting BLOCK for here which will be executed on each line of Input_file part from lines which start from >
print >> (file) ##Printing current line to output file named variable file value here.
close(file) ##Using close; to close the output file in back-end, to avoid too many files opened error.
} ##Closing BLOCK as mentioned above for this program.
' file* ##Passing all files here.
答案 1 :(得分:1)
对于您的问题,几乎没有任何假设
>
开头这是任何程序检测模式并进行过滤/拆分的条件
伪代码
for files in folder
for line in file
if it's gene, save as target_file_name
if not, push current_file_name and current_line to target_file_name
让我知道这是否满足您的要求,或者您需要进一步的实现/详细代码,bash
或awk
都应该起作用。