解析复杂的表结构

时间:2017-12-12 07:46:52

标签: c# regex

我正在尝试用C#解析

+-------------+-----------------------------------------------------------------------------------+----------------+
|      1      |                                         2                                         |        3       |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 000         | Собственные средства (капитал), итого,                                            |                |
|             | в том числе:                                                                      |     1024231079 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100         |Источники базового капитала:                                                       |     1291298211 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1       |Уставный капитал кредитной организации:                                            |      651033884 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.1     |сформированный обыкновенными акциями                                               |      129605413 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.2     |сформированный привилегированными акциями                                          |      521428471 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.1.3     |сформированный долями                                                              |              0 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2       |Эмиссионный доход:                                                                 |      439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+
| 100.2.1     |кредитной организации в организационно-правовой форме акционерного общества, всего,|                |
|             | в том числе:                                                                      |      439401101 |
+-------------+-----------------------------------------------------------------------------------+----------------+

我的代码是

string[] dels = { "\r\n" };
string[] strArr = someStr.Split(dels, StringSplitOptions.None);

Console.WriteLine(strArr);

foreach (String sourcestring in strArr)
{
    if (sourcestring != null)
    {
        Console.WriteLine("Processing string: ");
        Console.WriteLine(sourcestring);
        //Regex regex = new Regex(@"^(\|)(.*)(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
        //Regex regex = new Regex(@"^(\|)(\s?|\d+[\.?])(\|)(.*[а-я]{3}.*)(\|)(.*\d+.*)(\|)(.*[\d+|Х].*)(\|)(.*[\d+|Х].*)(\|)(.*\d+.*)(\|)$");
        Regex regex = new Regex(@"^(\|)(\d+\.?\d+)");
        MatchCollection mc = regex.Matches(sourcestring);
        int mIdx = 0;
        foreach (Match m in mc)
        {
            for (int gIdx = 0; gIdx < m.Groups.Count; gIdx++)
            {
                Console.WriteLine("[{0}][{1}] = {2}", mIdx, regex.GetGroupNames()[gIdx], m.Groups[gIdx].Value);
            }
            mIdx++;
        }
        Console.WriteLine("---------------------------------------------------------");
    }
}

我需要提取行的值

4 - ' 000         ', ' Собственные средства (капитал), итого,                                            ', '                '

5 - '             ', ' в том числе:                                                                      ', '     1024231079 '

和第7,9行......

现在的主要问题是,我不知道如何在第一列值中找到reg exp,可能是:

' 000         '

'             '

' 100         '

' 100.1       '

' 100.1.1     '

等等。

第二个问题在第二栏。我尝试使用(.*[а-я]{3}.*)解析它,但它在包含'('',''.',{{1}等符号的行上失败了}}

我会感谢所有可能的解决方案。

1 个答案:

答案 0 :(得分:1)

我认为RegEx would be overkill in this case,一种简单的手动解析方法会更容易:

  

有些人在面对问题时会想到,我知道,我会使用正则表达式。&#34;现在他们有两个问题。

在这种情况下可能有两种方法:

  1. 解析第一行(+---+--- ...)以确定每列的长度,并通过将其与Substring分开来解析数据。
  2. |分割每列。
  3. 下面,我概述了第二种方法的基础知识(没有健全性检查) 如果您的数据也可以包含|,则可能需要根据单元格大小解析数据,而不是按其拆分。

    // Row is defined below - simple data storage for three the columns
    List<Row> rows = new List<Row>();
    Row currentRow = null;
    
    // Process each line
    foreach (string line in input.Split(new string[] {"\r\n"}, StringSplitOptions.RemoveEmptyEntries))
    {
        // Row separator or content?
        if (line.StartsWith("+"))
        {
            if (currentRow != null)
            {
                rows.Add(currentRow);
                currentRow = null;
            }
        }
        else if (line.StartsWith("|"))
        {
            string[] parts = line.Split(new char[] {'|'});
            if(currentRow == null)
                currentRow = new Row();
    
            // Might need additional processing
            currentRow.Column1 += parts[1].Trim();
            currentRow.Column2 += parts[2].TrimEnd();
            currentRow.Column3 += parts[3].TrimStart();
        }
        else
        {
            //Invalid data?
        }
    }
    
    // Show result
    foreach(Row row in rows)
    {
        Console.WriteLine("[{0}][{1}] = {2}", row.Column1, row.Column2, row.Column3);
    }
    

    您可以使用Tuple<string,string,string>或任何适合您数据类型的内容,而不是自定义类。

    public class Row
    {
        public string Column1 = "";
        public string Column2 = "";
        public string Column3 = "";
    }
    

    Example on DotNetFiddle