用不规则图案分割数据

时间:2014-04-09 15:27:16

标签: c#

以下是一些真实的样本数据:

string s1 = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006"
string s2 = "WGT WHGN|c 0004 YLGN|c 0006"
string s3 = "296  312|d 0004  137.2|n 0006"
string s4 = "HGT SH|r 0004"
string s5 = "ANLP  ANLP1 PNPL|r 0004"

数据始终采用以下格式:[Group] [Value][Pipe + letter][Key][Value][Pipe + letter][Key]部分可能会重复多次。

有什么办法可以将这类数据分成以下几种:

string out1[] = { "CLR", "DRBR", "|r 0004", "BLCK", "|r 0006", "WHIT", "|r 0006" }
string out2[] = { "WGT", "WHGN", "|c 0004", "YLGN", "|c 0006" }
string out3[] = { "296", "312", "|m 0004", "137.2", "|n 0006" }
string out4[] = { "HGT", "SH", "|r 0004" }
string out5[] = { "ANLP", "ANLP1 PNPL", "|r 0004" }

请注意,s5的数据模式与其他模式略有不同。

  
    

这些是20世纪60年代的遗留数据,所以请不要问我这样做/为什么以这种方式存储数据。谢谢。

  

2 个答案:

答案 0 :(得分:1)

查看数据,您似乎有以下规则:

Phase 1 : Read to first space and split and remove space.
Phase 2 : Read to `|` and split prior to `|`.
Phase 3 : Include `|` and next 3 characters (space) and read to next space or EOT split and remove space if exists.
Goto Phase 2 if more data.

像这样(你可能想要比我输入更多的错误检查):

void Main()
{
  string s1 = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006";
  string s2 = "WGT WHGN|c 0004 YLGN|c 0006";
  string s3 = "296  312|d 0004  137.2|n 0006";
  string s4 = "HGT SH|r 0004";
  string s5 = "ANLP  ANLP1 PNPL|r 0004"  ;

   splitit(s1).Dump();
}

string [] splitit(string input)
{

    List<string> output = new List<string>();

    int index = 0;

    // phase one
    while (input[index] != ' ') index++;

    output.Add(input.Substring(0,index));
    // skip space
    while (input[index] == ' ') index++;

    int indexTmp = index;

    do
    {
      // phase two
      while (input[index] != '|') index++;
      output.Add(input.Substring(indexTmp,(index)-indexTmp));

      // phase three
      indexTmp = index;
      index = index + 3; // save | code and space
      while ((input[index] != ' ') && index < (input.Length-1)) index++;
      output.Add(input.Substring(indexTmp,(index)-indexTmp));

      // skip spaces
      while (input[index] == ' ') index++;
      indexTmp = index;
    } while(index < input.Length-1);  

    return output.ToArray();
}

答案 1 :(得分:0)

你有一个接受的答案,但只要你说我的方式不行,这就是我的意思:

int index;
List<string[]> output = new List<string[]>();
List<string> current = null;
string[] fields;

//i imagine this will be in an array when you read it in from a file
string[] input = new string[5];
input[0] = "CLR DRBR|r 0004  BLCK|r 0006  WHIT|r 0006";
input[1] = "WGT WHGN|c 0004 YLGN|c 0006";
input[2] = "296  312|d 0004  137.2|n 0006";
input[3] = "HGT SH|r 0004";
input[4] = "ANLP  ANLP1 PNPL|r 0004";

现在,您只需循环处理第一个记录,然后检查后续记录是否出现第二个空格并正确处理。

bool first = true;

//loop through each of the input records
foreach (string record in input)
{
    //split the input records based on the pipe character
    fields = record.Split("|".ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
    //loop through each of the fields
    foreach (string field in fields)
    {
        if (first) //split the first field based on the first space in field
        {
            current = new List<string>();
            index = field.IndexOf(" ");
            current.Add(field.Substring(0, index).Trim());
            current.Add(field.Substring(index + 1).Trim());
            first = false;
        }
        else  //split subsequent records based on second space if it exists
        {
             index = field.IndexOf(" ", 3);
             if (index == -1)
             {
                 current.Add("|" + field);
             }
             else
             {
                 current.Add("|" + field.Substring(0, index).Trim());
                 current.Add(field.Substring(index + 1).Trim());
             }
        }
    }

    //control break processing
    first = true;
    output.Add(current.ToArray());
}

您可以轻松地将内部循环修改为另一个函数。如果您测试我认为这会更快。