将多行电子邮件解析为var

时间:2019-03-31 22:06:50

标签: c# .net regex parsing

我正在尝试解析多行电子邮件,因此我可以获取电子邮件正文标题下位于其自己的换行符上的数据。 看起来像这样:

EMAIL STARTING IN APRIL

Marketing ID                                     Local Number
-------------------                              ----------------------
GR332230                                         0000232323

Dispatch Code                                    Logic code
-----------------                                -------------------
GX3472                                           1

Destination ID                                   Destination details
-----------------                                -------------------
3411144

使用字符串阅读器readline时,似乎每个消息框上的所有内容都得到了,尽管我想要的只是每个------下的数据,如图所示

这是我的代码:

foreach (MailItem mail in publicFolder.Items)
{
    if (mail != null)                  
    {

        if (mail is MailItem)
        {

            MessageBox.Show(mail.Body, "MailItem body");
            // Creates new StringReader instance from System.IO
            using (StringReader reader = new StringReader(mail.Body))
            {
                string line;
                while ((line = reader.ReadLine()) !=null) 
                //Loop over the lines in the string.
                if (mail.Body.Contains("Marketing ID"))
                {
                    // var localno = mail.Body.Substring(247,15);//not correct approach

                    // MessageBox.Show(localrefno);
                    //MessageBox.Show("found");
                    //var conexid = mail.Body.Replace(Environment.NewLine);

                    var regex = new Regex("<br/>", RegexOptions.Singleline);


                    MessageBox.Show(line.ToString());
                }
            }


            //var stringBuilder = new StringBuilder();

            //foreach (var s in mail.Body.Split(' '))
            //{
            //    stringBuilder.Append(s).AppendLine();
            //}
            //MessageBox.Show(stringBuilder.ToString());



        }
        else
        {
            MessageBox.Show("Nothing found for MailItem");
        }
    }
}    

您可以看到,即使使用子串位置和正则表达式,我也进行了许多尝试。请帮助我从---下的每一行中获取数据。

4 个答案:

答案 0 :(得分:1)

  var dict = new Dictionary<string, string>();
            try
            {
                var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
                int starts = 0, end = 0, length = 0;
                while (!lines[starts + 1].StartsWith("-")) starts++;
                for (int i = starts + 1; i < lines.Length; i += 3)
                {
                    var mc = Regex.Matches(lines[i], @"(?:^| )-");
                    foreach (Match m in mc)
                    {
                        int start = m.Value.StartsWith(" ") ? m.Index + 1 : m.Index;
                        end = start;
                        while (lines[i][end++] == '-' && end < lines[i].Length - 1) ;
                        length = Math.Min(end - start, lines[i - 1].Length - start);
                        string key = length > 0 ? lines[i - 1].Substring(start, length).Trim() : "";
                        end = start;
                        while (lines[i][end++] == '-' && end < lines[i].Length) ;
                        length = Math.Min(end - start, lines[i + 1].Length - start);
                        string value = length > 0 ? lines[i + 1].Substring(start, length).Trim() : "";
                        dict.Add(key, value);
                    }
                }
            }
            catch (Exception ex)
            {
                throw new Exception("Email is not in correct format");
            }

Live Demo

使用正则表达式:

     var dict = new Dictionary<string, string>();
        try
        {
            var lines = email.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);
            int starts = 0;
            while (!lines[starts + 1].StartsWith("-")) starts++;
            for (int i = starts + 1; i < lines.Length; i += 3)
            {
                var keys = Regex.Matches(lines[i - 1], @"(?:^| )(\w+\s?)+");
                var values = Regex.Matches(lines[i + 1], @"(?:^| )(\w+\s?)+");
                if (keys.Count == values.Count)
                    for (int j = 0; j < keys.Count; j++)

                        dict.Add(keys[j].Value.Trim(), values[j].Value.Trim());
                else // remove bug if value of first key in a line has no value
                {
                    if (lines[i + 1].StartsWith(" "))
                    {
                        dict.Add(keys[0].Value.Trim(), "");
                        dict.Add(keys[1].Value.Trim(), values[0].Value.Trim());
                    }
                    else
                    {
                        dict.Add(keys[0].Value, values[0].Value.Trim());
                        dict.Add(keys[1].Value.Trim(), "");
                    }
                }

            }
        }
        catch (Exception ex)
        {
            throw new Exception("Email is not in correct format");
        }

Live Demo

答案 1 :(得分:0)

这是我的尝试。我不知道电子邮件格式是否可以更改(行,列等)。

除了检查双倍空格(我的解决方案)之外,我想不出一种简单的方法来分隔列。

c.images[2]  # Third image.

输出看起来像这样:

营销编号GR332230本地号码0000232323 调度代码GX3472逻辑代码1 目的地ID,3411144,目的地详细信息,

答案 2 :(得分:0)

这里是一个假设,假设您不需要标题,信息按顺序排列并且是必需的。 对于具有空格或可选字段的数据,此方法不起作用。

foreach (MailItem mail in publicFolder.Items)
{
  MessageBox.Show(mail.Body, "MailItem body");
  // Split by line, remove dash lines.
  var data = Regex.Split(mail.Body, @"\r?\n|\r")
    .Where(l => !l.StartsWith('-'))
    .ToList();
  // Remove headers
  for(var i = data.Count -2; lines >= 0; i -2)
  {
    data.RemoveAt(i);
  }
  // now data contains only the info you want in the order it was presented.
  // Asuming info doesn't have spaces.
  var result = data.SelectMany(d => d.Split(' '));
  // WARNING: Missing info will not be present.
  // {"GR332230", "0000232323", "GX3472", "1", "3411144"}
}

答案 3 :(得分:0)

使用Regex这样做不是一个好主意,因为忘记边缘情况非常容易,不容易理解,也不容易调试。遇到Regex挂起CPU并超时的情况很容易。 (我无法对其他答案做出任何评论。因此,在选择最终解决方案之前,请至少检查我的另外两种情况。)

在您的情况下,以下Regex解决方案适用于您提供的示例。但是,还有一些其他限制:您需要确保在非起始或非终止列中没有空值。或者,假设有两列以上,而中间的任何一列为空,则会使该行的名称和值不匹配。

不幸的是,由于我不了解规格,因此我无法为您提供非Regex解决方案,例如:是否会有空白?会有标签吗?每个字段都具有固定的字符数,还是灵活?如果它是灵活的并且可以具有空值,那么什么样的规则可以检测哪些列为空?我认为它们很有可能由列名的长度定义,并且只有空格作为定界符。如果是这种情况,有两种方法可以解决此问题:两次通过Regex或编写自己的解析器。如果所有字段的长度都固定,那么这样做会更加容易:只需使用子字符串剪切线,然后修剪它们。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;

public class Program
{
    public class Record{
        public string Name {get;set;}
        public string Value {get;set;}
    }

    public static void Main()
    {
        var regex = new Regex(@"(?<name>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<name>((?!-)[\w]+[ ]?)+)?)+(?:\r\n|\r|\n)(?>(?<splitters>(-+))(?>[ \t]+)?)+(?:\r\n|\r|\n)(?<value>((?!-)[\w]+[ ]?)*)(?>(?>[ \t]+)?(?<value>((?!-)[\w]+[ ]?)+)?)+", RegexOptions.Compiled);
        var testingValue =
@"EMAIL STARTING IN APRIL

Marketing ID                                     Local Number
-------------------                              ----------------------
GR332230                                         0000232323

Dispatch Code                                    Logic code
-----------------                                -------------------
GX3472                                           1

Destination ID                                   Destination details
-----------------                                -------------------
3411144";
        var matches = regex.Matches(testingValue);

        var rows = (
            from match in matches.OfType<Match>()
            let row = (
                from grp in match.Groups.OfType<Group>()
                select new {grp.Name, Captures = grp.Captures.OfType<Capture>().ToList()}
            ).ToDictionary(item=>item.Name, item=>item.Captures.OfType<Capture>().ToList())
            let names = row.ContainsKey("name")? row["name"] : null
            let splitters = row.ContainsKey("splitters")? row["splitters"] : null
            let values = row.ContainsKey("value")? row["value"] : null
            where names != null && splitters != null &&
                names.Count == splitters.Count &&
                (values==null || values.Count <= splitters.Count)
            select new {Names = names, Values = values}
            );

        var records = new List<Record>();
        foreach(var row in rows)
        {
            for(int i=0; i< row.Names.Count; i++)
            {
                records.Add(new Record{Name=row.Names[i].Value, Value=i < row.Values.Count ? row.Values[i].Value : ""});
            }
        }

        foreach(var record in records)
        {
            Console.WriteLine(record.Name + " = " + record.Value);
        }
    }
}

输出:

Marketing ID  = GR332230 
Local Number = 0000232323
Dispatch Code  = GX3472 
Logic code = 1
Destination ID  = 3411144
Destination details =

请注意,这也适用于此类消息: 电子邮件从4月开始

Marketing ID                                     Local Number
-------------------                              ----------------------
GR332230                                         0000232323

Dispatch Code                                    Logic code
-----------------                                -------------------
GX3472                                           1

Destination ID                                   Destination details
-----------------                                -------------------
                                                 3411144

输出:

Marketing ID  = GR332230 
Local Number = 0000232323
Dispatch Code  = GX3472 
Logic code = 1
Destination ID  = 
Destination details = 3411144

或者这个:

EMAIL STARTING IN APRIL

Marketing ID                                     Local Number
-------------------                              ----------------------


Dispatch Code                                    Logic code
-----------------                                -------------------
GX3472                                           1

Destination ID                                   Destination details
-----------------                                -------------------
                                                 3411144               

输出:

Marketing ID  = 
Local Number = 
Dispatch Code  = GX3472 
Logic code = 1
Destination ID  = 
Destination details = 3411144