根据特定内容将大文本文件拆分为较小的文本文件

时间:2012-04-16 23:20:45

标签: c# save streamreader delimiter split

我有一大堆基因组,我需要把它分成小的.txt文件。

序列看起来像这样

>supercont1.1 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.2 of Geomyces destructans 20631-21
AGATTTTCTTAATAACTTGTTCAATGTGTGTTCAAATGATATGCCGTGATGTATGTAGCA
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
TAAACAGATGTAGTAGAAGAGTTTGCAGCAATCGTTGAGTAGTATTGCTTCTGTTGTTGG
>supercont1.3 of Geomyces destructans 20631-21
AGATTTT (...)

它应该被分成小文件,名称为:“1.1-Geomyces-destructans - 20631-21”,“1.2-Geomyces ......”用基因组数据完成。

@JimMischel帮助后的代码如下:

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Windows.Forms;
using System.IO;

namespace genom1
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        string filter = "Textové soubory|*.txt|Soubory FASTA|*.fasta|Všechny soubory|*.*";

        private void doit_Click(object sender, EventArgs e)
        {
            bar.Value = 0;

            OpenFileDialog opf = new OpenFileDialog();

            // filter for choosing file types
            opf.Filter = filter;

            string lineo = "error"; // test

            if (opf.ShowDialog() == DialogResult.OK)
            {
                var lineCount = 0;
                using (var reader = File.OpenText(opf.FileName))
                {
                    while (reader.ReadLine() != null)
                    {
                        lineCount++;
                    }
                }

                bar.Maximum = lineCount;
                bar.Step = 1;

                FolderBrowserDialog fbd = new FolderBrowserDialog();

                fbd.Description = "Vyber složku, do které chceš rozdělit načtený soubor: \n\n" + opf.FileName; // dialog desc
                if (fbd.ShowDialog() == DialogResult.OK)
                {
                    List<string> lines = new List<string>();
                    foreach (var line in File.ReadLines(opf.FileName))
                    {
                        bar.PerformStep();

                        if (line[0] == '>')
                        {
                           if (lines.Count >= 0)
                            {
                                // write contents of lines list to file

                                //quicker replace for better file name
                                StringBuilder prep = new StringBuilder(line);
                                prep.Replace(">supercont", "");
                                prep.Replace("of", "");
                                prep.Replace(" ", "-");
                                lineo = prep.ToString();

                                // append or writeall? how to writeall lines without append?
                                //System.IO.File.WriteAllText(fbd.SelectedPath + "\\" + lineo + ".txt", lineo);
                                StreamWriter SW;
                                SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");

                                foreach (string s in lines)
                                    {
                                        SW.WriteLine(s);
                                    }

                                SW.Close();

                                // and clear the list.
                                lines.Clear();
                            }
                        }
                        lines.Add(line);
                    }
                    // here, do the last part
                    if (lines.Count >= 0)
                    {
                        // write contents of lines list to file.

                        /* starts being little buggy here...

                        StreamWriter SW;
                        SW = File.AppendText(fbd.SelectedPath + "\\" + lineo + ".txt");
                        foreach (string s in lines)
                        {
                            SW.WriteLine(s);
                        }
                        SW.Close();

                        */
                    }
                }

            }
        }

    }
}

2 个答案:

答案 0 :(得分:2)

如果文件足够大以适合内存,则可以调用File.ReadAllText将其转换为字符串。然后,您将浏览并在>个字符之间提取文本。类似的东西:

string s = File.ReadAllText("filename");
int pos = s.IndexOf('>');
while (pos != -1)
{
    int newpos = s.IndexOf('>', pos+1);
    string text = s.Substring(pos+1, newpos - pos);
    // now write text to a file

    // update current position
    pos = newpos;
}
// here you'll have to handle the last part of the file specially.

我假设您可以弄清楚如何正确命名文件。

如果您无法将整个文件放入内存,那么您可以逐个字符地读取文件或进行某种缓冲。如果您知道>始终位于一行的开头,则问题会更容易。然后你可以写:

List<string> lines = new List<string>();
foreach (var line in File.ReadLines("filename"))
{
    if (line[0] == '>')
    {
        if (lines.Count > 0)
        {
            // write contents of lines list to file.
            // and clear the list.
            lines.Clear();
        }
    }
    lines.Add(line);
}
// here, do the last part
if (lines.Count > 0)
{
    // write contents of lines list to file.
}

答案 1 :(得分:1)

我想说最简单的方法是先用File.ReadAllText()读取整个文件。然后只需使用String.Split(">"),它将返回我认为是新文件内容的数组。