我有一个包含数千种蛋白质序列的大型fasta文件。我想将这个文件分成多个文件。
我正在将ActivePerl用于我的项目
答案 0 :(得分:1)
每个文件需要多少个序列?
你可以做这样的事情
#!/usr/bin/perl -w
my $fasta_file = "something.fasta";
my $seqs_per_file = 100; # whatever your batch size
my $file_number = 1; # our files will be named like "something.fasta.1"
my $seq_ctr = 0;
open(FASTA, $fasta_file) || die("can't open $fasta_file");
while(<FASTA>) {
if(/^>/) {
# open a new file if we've printed enough to one file
if($seq_ctr++ % $seqs_per_file == 0) {
close(OUT);
open(OUT, "> " . $fasta_file . "." . $file_number++);
}
}
print OUT $_;
}
答案 1 :(得分:0)
你可以轻松地使用awk而不是perl。
awk '/^\>/{file=$0}{print >file".txt"}' your_fasta_file
答案 2 :(得分:0)
此代码使用Java。我不介意管理员从这里删除它。但如果它有帮助。 :)强>
/**
* This tool aims to chop the file in various parts based on the number of sequences required in one file.
*/
package devtools.utilities;
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import org.apache.commons.lang3.StringUtils;
//import java.util.List;
/**
* @author Arpit
*
*/
public class FileChopper {
public void chopFile(String fileName, int numOfFiles) throws IOException {
byte[] allBytes = null;
String outFileName = StringUtils.substringBefore(fileName, ".fasta");
try {
allBytes = Files.readAllBytes(Paths.get(fileName));
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
String allLines = new String(allBytes, StandardCharsets.UTF_8);
// Using a clever cheat with help from stackoverflow
String cheatString = allLines.replace(">", "~>");
cheatString = cheatString.replace("\\s+", "");
String[] splitLines = StringUtils.split(cheatString, "~");
int startIndex = 0;
int stopIndex = 0;
FileWriter fw = null;
for (int j = 0; j < numOfFiles; j++) {
fw = new FileWriter(outFileName.concat("_")
.concat(Integer.toString(j)).concat(".fasta"));
if (j == (numOfFiles - 1)) {
stopIndex = splitLines.length;
} else {
stopIndex = stopIndex + (splitLines.length / numOfFiles);
}
for (int i = startIndex; i < stopIndex; i++) {
fw.write(splitLines[i]);
}
if (j < (numOfFiles - 1)) {
startIndex = stopIndex;
}
fw.close();
}
}
/**
* @param args
*/
public static void main(String[] args) {
// TODO Auto-generated method stub
FileChopper fc = new FileChopper();
try {
fc.chopFile("H:\\Projects\\Lactobacillus rhamnosus\\Hypothetical proteins sequence 405 LR24.fasta",5);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
答案 3 :(得分:-2)
我知道你说你想要它在Perl中。但我已经使用python和BioPython多次这样做了,我相信它与BioPerl相当(但更好:)。
import sys
import Bio
def write_file(input_file,split_number):
#get file_counter and base name of fasta_file
parent_file_base_name = input_file(".")[0]
counter = 1
#our first file name
file = parent_file_base_name + "_" + str(counter) + ".fasta"
#carries all of our records to be written
joiner = []
#enumerate huge fasta
for num,record in enumerate(Bio.SeqIO.parse(input_file, "fasta"),start=1):
#append records to our list holder
joiner.append(">" + record.id + "\n" + str(record.seq))
#if we have reached the maximum numbers to be in that file, write to a file, and then clear
#record holder
if num % split_number == 0:
joiner.append("")
with open(file,'w') as f:
f.write("\n".join(joiner))
#change file name,clear record holder, and change the file count
counter += 1
file = parent_file_base_name + "_" + str(counter) + ".fasta"
joiner = []
if joiner:
joiner.append("")
with open(file,'w') as f:
f.write("\n".join(joiner))
if __name__ == "__main__":
input_file = sys.argv[1]
split_number = sys.argv[2]
write_file(input_file,split_number)
print "fasta_splitter.py is finished."
只需用
运行它python script.py parent_fasta.fasta <how many records per file>