Question

我有一个包含数千种蛋白质序列的大型fasta文件。我想将这个文件分成多个文件。

我正在将ActivePerl用于我的项目

Answer 1

每个文件需要多少个序列？

你可以做这样的事情

#!/usr/bin/perl -w

my $fasta_file = "something.fasta";
my $seqs_per_file = 100;  # whatever your batch size

my $file_number = 1;  # our files will be named like "something.fasta.1"
my $seq_ctr = 0;

open(FASTA, $fasta_file) || die("can't open $fasta_file");

while(<FASTA>) {

    if(/^>/) {

       # open a new file if we've printed enough to one file
       if($seq_ctr++ % $seqs_per_file == 0) {
         close(OUT);
         open(OUT, "> " . $fasta_file . "." . $file_number++);
       }

    }

    print OUT $_;

 }

Answer 2

你可以轻松地使用awk而不是perl。

awk '/^\>/{file=$0}{print >file".txt"}' your_fasta_file

Answer 3

此代码使用Java。我不介意管理员从这里删除它。但如果它有帮助。：）

/** * This tool aims to chop the file in various parts based on the number of sequences required in one file. */ package devtools.utilities; import java.io.FileWriter; import java.io.IOException; import java.nio.charset.StandardCharsets; import java.nio.file.Files; import java.nio.file.Paths; import org.apache.commons.lang3.StringUtils; //import java.util.List; /** * @author Arpit * */ public class FileChopper { public void chopFile(String fileName, int numOfFiles) throws IOException { byte[] allBytes = null; String outFileName = StringUtils.substringBefore(fileName, ".fasta"); try { allBytes = Files.readAllBytes(Paths.get(fileName)); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } String allLines = new String(allBytes, StandardCharsets.UTF_8); // Using a clever cheat with help from stackoverflow String cheatString = allLines.replace(">", "~>"); cheatString = cheatString.replace("\\s+", ""); String[] splitLines = StringUtils.split(cheatString, "~"); int startIndex = 0; int stopIndex = 0; FileWriter fw = null; for (int j = 0; j < numOfFiles; j++) { fw = new FileWriter(outFileName.concat("_") .concat(Integer.toString(j)).concat(".fasta")); if (j == (numOfFiles - 1)) { stopIndex = splitLines.length; } else { stopIndex = stopIndex + (splitLines.length / numOfFiles); } for (int i = startIndex; i < stopIndex; i++) { fw.write(splitLines[i]); } if (j < (numOfFiles - 1)) { startIndex = stopIndex; } fw.close(); } } /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub FileChopper fc = new FileChopper(); try { fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }

Answer 4

我知道你说你想要它在Perl中。但我已经使用python和BioPython多次这样做了，我相信它与BioPerl相当（但更好:)。

import sys
import Bio
def write_file(input_file,split_number):
    #get file_counter and base name of fasta_file
    parent_file_base_name = input_file(".")[0]
    counter = 1

    #our first file name
    file = parent_file_base_name + "_" + str(counter) + ".fasta"

    #carries all of our records to be written
    joiner = []
    #enumerate huge fasta
    for num,record in enumerate(Bio.SeqIO.parse(input_file, "fasta"),start=1):
        #append records to our list holder
        joiner.append(">" + record.id + "\n" + str(record.seq))

        #if we have reached the maximum numbers to be in that file, write to a file, and then clear
        #record holder
        if num % split_number == 0:
            joiner.append("")
            with open(file,'w') as f:
                f.write("\n".join(joiner))    

            #change file name,clear record holder, and change the file count
            counter += 1
            file = parent_file_base_name + "_" + str(counter) + ".fasta"  
            joiner = []
      if joiner:
        joiner.append("")
        with open(file,'w') as f:
          f.write("\n".join(joiner))

if __name__ == "__main__":
    input_file = sys.argv[1]
    split_number = sys.argv[2]
    write_file(input_file,split_number)
    print "fasta_splitter.py is finished."

只需用

运行它

python script.py parent_fasta.fasta <how many records per file>

如何将包含大量蛋白质序列的fasta文件拆分成多个文件

4 个答案: