Question

我有一个用Cobol编写的工程信息的大型文本数据集（约2 GB）。我正在尝试提取其中的某些子字符串，并使用提取的数据制作一个CSV列表。

感兴趣的子字符串出现在每个记录中的已知位置。但是，数据本身内没有唯一的标识符（主键）。它只是一个数据列表，其中每个“记录”均以以“ 01”开头的行开头。随后的每一行都属于同一记录，直到下一个“ 01”为止。给定行的存在可能会有所不同，但是如果存在，则数据会以特定的间隔发生。

数据如下：

Line1: 01253820RELEVANTSUBSTRING39ALSORELEVANT0990
Line2: 02999IRRELEVANT
Line3: 0420180101RELEVANTMONTHLYDATA000MORERELEVANTDATA8980
Line4: 0420190101FURTHERRELEVANTMONTHLYDATA
Line5: 12000003848982IRRELEVANT
Line6: 0100NEWRECORD8932000
Line7: 0420100101MORE

我已经能够使用以下代码（部分包含在下面）成功提取每个“ 01”之后出现的相关子字符串：

static void PopulateList(){
using (StreamReader sr = new StreamReader(sourcePath))
    {
      string ctrl  //control key - indicates a new record if "01"

       List<TurbineModel> turbines = new List<TurbineModel>();

        List<string> lines = File.ReadAllLines(sourcePath).ToList();

        foreach (string line in lines)
        {
            if (line.Substring(0, 2) == "01")
            {
                ctrl = line.Substring(0, 2);
                TurbineModel newWell = new TurbineModel();
                newTurbine.Ctrl = ctrl;

                turbines.Add(newTurbine);
            }
        }
      }

此代码运行正常。但是，还有几行以“ 04”开头的行，这些行还有我无法提取的其他信息并无法与当前的“ 01”列表进行分组。我可以从以“ 04”开头的每一行中提取子字符串，但是我无法将每个记录的数据链接到其前面的“ 01”记录。

我需要执行的代码如下：

1）到达数据中的“ 01”并设置新记录 2）从“ 01”行中提取相关信息（上面的代码） 3）除非达到“ 04”，否则跳过后续行 4）如果到达“ 04”，则从该行中提取子字符串，并将提取的子字符串与“ 01”子字符串分组 5）继续扫描行，直到到达新的“ 01”，这时它将建立新的记录并再次开始 6）将所有内容输出到CSV

我无法将信息分组在一起，所以我知道哪个“ 04”与哪个“ 01”相关。

非常感谢您提供的任何帮助。让我知道是否可以澄清。

Answer 1

试试吧，这是一个“大块读者” :)我过去使用过类似的方法。它可能需要一些工作，但会将您的样本解析为2个“块”。

namespace Solution
{
    class Solution
    {
        static void Main(string[] args)
        {
            var reader = new ChunkReader();
            Chunk chunk = null;

            foreach (Chunk c in reader.Read(@"D:\test.txt"))
            {
                Console.WriteLine(c.Header);
            }

            Console.ReadKey();
        }
    }

    internal class ChunkReader
    {
        public IEnumerable<Chunk> Read(string filePath)
        {
            Chunk currentChunk = null;

            using (StreamReader reader = new StreamReader(File.OpenRead(filePath)))
            {
                string currentLine;

                while ((currentLine = reader.ReadLine()) != null)
                {
                    if (currentLine.StartsWith("01"))
                    {
                        if (currentChunk != null)
                        {
                            yield return currentChunk;
                        }

                        currentChunk = new Chunk();
                        currentChunk.Contents.Add(currentLine);
                    }
                    else
                    {
                        currentChunk?.Contents.Add(currentLine);
                    }
                }
            }

            yield return currentChunk;
        }
    }

    internal class Chunk
    {
        public Chunk()
        {
            Contents = new SortedSet<string>();
        }

        public SortedSet<string> Contents { get; }

        public string Header
        {
            get
            {
                return Contents.FirstOrDefault(s => s.StartsWith("01"));
            }
        }
    }
}

Answer 2

首先，正如其他人所建议的那样，如果文件很大，则应考虑使用File.ReadAllLines()的替代方法，因为这样做可能会增加成本。但是，既然问题不是关于这个的，那我就跳过那个。

首先，当您知道某行是以01或04开头时，两个虚拟函数可模拟提取所需数据。

static string Extract01Data(string line)
{
    return line;
}

static string Extract04Data(string line)
{
    return line;
}

编辑

编辑了答案，以容纳第一行04之后的以01开头的多行：

还有一个简单的类来保存您的结果数据：

public class Record
{
    public string OneInfo { get; set; }
    public List<string> FourInfo { get; set; } = new List<string>();
}

然后，这是我的代码，并在注释中进行解释：

static void Main()
{
    var file = @"C:\Users\gurudeniyas\Desktop\CobolData.txt";
    var lines = File.ReadAllLines(file).ToList();

    var records = new List<Record>();

    for (var count = 0; count < lines.Count; count++)
    {
        var line = lines[count];
        var firstTwo = line.Substring(0, 2);
        // Iterate till we find a line that starts with 01
        if (firstTwo == "01")
        {
            // Create a Record and add 01 line related data
            var rec = new Record
            {
                OneInfo = Extract01Data(line)
            };

            // Here we iterate to find preceding lines that start with 03
            // If we find them, extract 04 data and add as a record
            // Break out of the loop if we find the next 01 line or EOF
            do
            {
                count++;
                if (count == lines.Count)
                    break;
                line = lines[count];
                firstTwo = line.Substring(0, 2);
                if (firstTwo == "04")
                {
                    rec.FourInfo.Add(Extract04Data(line));
                }
            } while (firstTwo != "01");

            // If we found next 01, backtrack count by 1 so in the outer loop we can process that record again
            if (firstTwo == "01")
            {
                count--;
            }
            records.Add(rec);
        }
    }

    Console.ReadLine();
}

Answer 3

在我看来，您要做的就是创建一个类，该类可以存储01行中的数据，并且可以容纳以下各行的相关部分。

这是一个示例，其中我们循环遍历文件中的每一行，如果该行以"01开头，我们将创建一个新的Item并将其添加为Data（您可以对行内容进行一些处理，而不是填充其他属性）。如果该行不是以"01"开头，并且我们已经创建了Item，那么如果该行以AssociatedLines开头，则将该行添加到该项目的"04"属性中（您还可以通过某种方式处理该行，然后将相关部分添加到Item中。

最后，我们有Item个对象的列表，每个对象都是从以"01开始的一行创建的，并且包含此后的所有行，直到下一个以{开头的行{1}}。

首先，"01"类：

Item

然后是基于文件数据创建这些列表的代码：

public class Item
{
    public string Data { get; set; }
    public List<string> AssociatedData { get; set; } = new List<string>();

    // This returns a comma-separated line representing this item
    public string GetCsvString()
    {
        return $"{Data},{string.Join(",", AssociatedData)}";
    }
}

然后，调用上述方法的代码将如下所示：

public static List<Item> GetItems(string filePath)
{
    var items = new List<Item>();
    Item current = null;

    foreach (var line in File.ReadAllLines(filePath))
    {
        if (line.StartsWith("01"))
        {
            // If there's already a current item, add it to our list
            if (current != null) items.Add(current);

            // Here we would parse the '01' line and set properties of the current item
            current = new Item {Data = line};
        }
        else if (line.StartsWith("04"))
        {
            // Here we would parse the '04' line and set properties of the current item
            current?.AssociatedData.Add(line);
        }
    }

    // Add the final item to our list
    if (current != null) items.Add(current);

    return items;
}

最好通过覆盖var items = GetItems(@"f:\public\temp\temp.txt");类上的CSV方法或提供一个ToString()方法来将项目提取到Item文件中来完成。正确格式的相关数据。之后，您可以将项目写入csv文件，例如：

GetCsvString()

Answer 4

如果“ 04”始终跟在01后面，则可以如下所示添加else，然后访问列表中的最后一项（这将起作用，因为将项目添加到列表会将其添加到末尾）。 / p>

foreach (string line in lines)
{
    if (line.Substring(0, 2) == "01")
    {
        ctrl = line.Substring(0, 2);
        TurbineModel newWell = new TurbineModel();
        newTurbine.Ctrl = ctrl;

        turbines.Add(newTurbine);
     }
     else if (line.Substring(0, 2) == "04")
     {
        var lastTurbine = turbines[turbines.Count - 1];
        //do what you need to do with the "04" record monthly data here
     }
}

Answer 5

您是否看过使用有限状态机算法？似乎是理想的选择。

如何在不同的子字符串触发器上收集特定子字符串的列表？

5 个答案:

编辑