我有一大堆基因组,我需要把它分成小的.txt文件。
序列看起来像这样
>supercont1.1 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.2 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.3 of Geomyces destructans 20631-21
AGATTTT (...)
它应该被分成小文件,名称为:“1.1-Geomyces-destructans - 20631-21”,“1.2-Geomyces ......”用基因组数据完成。
@JimMischel帮助后的代码如下:
using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;
namespace genom1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*";
private void doit_Click(object sender, EventArgs e)
{
bar.Value = 0;
OpenFileDialog opf = new OpenFileDialog();
// filter for choosing file types
opf.Filter = filter;
string lineo = "error"; // test
if (opf.ShowDialog() == DialogResult.OK)
{
var lineCount = 0;
using (var reader = File.OpenText(opf.FileName))
{
while (reader.ReadLine() != null)
{
lineCount++;
}
}
bar.Maximum = lineCount;
bar.Step = 1;
FolderBrowserDialog fbd = new FolderBrowserDialog();
fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: \n\n" + opf.FileName; // dialog desc
if (fbd.ShowDialog() == DialogResult.OK)
{
List<string> lines = new List<string>();
foreach (var line in File.ReadLines(opf.FileName))
{
bar.PerformStep();
if (line[0] == '>')
{
if (lines.Count >= 0)
{
// write contents of lines list to file
//quicker replace for better file name
StringBuilder prep = new StringBuilder(line);
prep.Replace(">supercont", "");
prep.Replace("of", "");
prep.Replace(" ", "-");
lineo = prep.ToString();
// append or writeall? how to writeall lines without append?
//System.IO.File.WriteAllText(fbd.SelectedPath + "\\" + lineo + ".txt", lineo);
StreamWriter SW;
SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");
foreach (string s in lines)
{
SW.WriteLine(s);
}
SW.Close();
// and clear the list.
lines.Clear();
}
}
lines.Add(line);
}
// here, do the last part
if (lines.Count >= 0)
{
// write contents of lines list to file.
/* starts being little buggy here...
StreamWriter SW;
SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");
foreach (string s in lines)
{
SW.WriteLine(s);
}
SW.Close();
*/
}
}
}
}
}
}
答案 0 :(得分:2)
如果文件足够大以适合内存,则可以调用File.ReadAllText
将其转换为字符串。然后,您将浏览并在>
个字符之间提取文本。类似的东西:
string s = File.ReadAllText("filename");
int pos = s.IndexOf('>');
while (pos != -1)
{
int newpos = s.IndexOf('>', pos+1);
string text = s.Substring(pos+1, newpos - pos);
// now write text to a file
// update current position
pos = newpos;
}
// here you'll have to handle the last part of the file specially.
我假设您可以弄清楚如何正确命名文件。
如果您无法将整个文件放入内存,那么您可以逐个字符地读取文件或进行某种缓冲。如果您知道>
始终位于一行的开头,则问题会更容易。然后你可以写:
List<string> lines = new List<string>();
foreach (var line in File.ReadLines("filename"))
{
if (line[0] == '>')
{
if (lines.Count > 0)
{
// write contents of lines list to file.
// and clear the list.
lines.Clear();
}
}
lines.Add(line);
}
// here, do the last part
if (lines.Count > 0)
{
// write contents of lines list to file.
}
答案 1 :(得分:1)
我想说最简单的方法是先用File.ReadAllText()读取整个文件。然后只需使用String.Split(">"),它将返回我认为是新文件内容的数组。