正则表达式从邮件存档中提取“发件人”,包括姓名和换行符

时间:2014-09-22 16:10:01

标签: c# regex

我以下面的格式获取邮件存档,我的目标是解析它们并将它们存储在数据库中。我在下面的示例中使用多个样本来演示数据。唯一要注意的是" 来自"线

    From:         FirstName LastName <FirstName.MiddleName.LastName@someemail.com>
    In-Reply-To:  <fc7b93ca4dab.531f4e68@my.bcit.ca>
    -------------------------------------------------    
    From:         "FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName"
                  <somemeail@something.otherthing.es> 
    Subject:      Re: Some Randome Data 
    In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>
    -------------------------------------------------   
    From:         "FirstName MiddleName LastName" <LastName@someemail.com>
    Subject:      Some Randome Subject 
    -------------------------------------------------
    From:         "FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName"
                  <somemeail@something.otherthing.es
                  > 
    Subject:      Re: Some Randome Data 
    In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>
    -------------------------------------------------   
    From:         "FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName"
                  <
                  somemeail@something.otherthing.es
                  > 
    Subject:      Re: Some Randome Data 
    In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>

到目前为止,我注意到所有标题除了&#34; 来自&#34;总是一致的,但它们总是出现在同一条线上,而且#34; 来自&#34;给了我很多时间。

我在我的C#代码中使用以下正则表达式来提取&#34;来自&#34;。

match = Regex.Match(msg, @"(?<=From:)", RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

我也试过以下表达,但它弄乱了其他记录。

match = Regex.Match(msg, @"(?<=From:).*.\s*.*\s*(>)", RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

我想做下面的事情 - 抓住以From开头的行:但不要捕获它,即(?&lt; = From :) - 现在继续,直到你到达&#34;&gt;&#34;它必须包括像空格,换行符

这样的所有内容

我正在努力想出这个表达方式。

我已经通过regex-that-matches-a-newline-n-in-c-sharpc-sharp-regex-match-any-text-between-tags-including-new-lines,但无法在我的代码中实现它。

完整示例代码

    class Program
        {
            static void Main(string[] args)
            {
                foreach (var demoText in TestData())
                {
                    var match = Regex.Match(demoText, @"(?<=From:).*", RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
                    if (match.Success)
                    {
                        string fromField = match.Value.Replace(System.Environment.NewLine, " ");

                        // Found From - extract the email address
                        match = Regex.Match(fromField, @"(?<=<)+[^<>]+(?=>)+", RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
                        Console.WriteLine("Email Address:" + match.Value);

                        // Extract the name
                        match = Regex.Match(fromField, @".*(?=<)", RegexOptions.Multiline | RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);
                        Console.WriteLine("Name:" + match.Value);
                    }
                    else
                    {
                        Console.WriteLine("*** Match not found in data: " + demoText);
                    }
                }
                Console.WriteLine("All done, press any key to close.");
                Console.ReadLine();
            }

        static IEnumerable<string> TestData()
        {
            return @"
From:         FirstName LastName <FirstName.MiddleName.LastName@someemail.com>
In-Reply-To:  <fc7b93ca4dab.531f4e68@my.bcit.ca>ñ


From:         ""FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName""
                <somemeail@something.otherthing.es> 
Subject:      Re: Some Randome Data 
In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>ñ


From:         ""FirstName MiddleName LastName"" <LastName@someemail.com>
Subject:      Some Randome Subject ñ

From:         ""FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName""
                <somemeail@something.otherthing.es
                > 
Subject:      Re: Some Randome Data 
In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>ñ


From:         ""FirstName. MiddleName =?iso-8859-1?b?TWFydO1uZXo=?= LastName""
                <
                somemeail@something.otherthing.es
                > 
Subject:      Re: Some Randome Data 
In-Reply-To: <42043F8EC804DB48A3C4AF477195328F272CB9@exchange.something.local>
".Split('ñ').Select(item => item.Trim());

2 个答案:

答案 0 :(得分:3)

(?<=From:)((?:(?!>).)*)>

试试这个。不要忘记设置sDOTALL标志。参见演示。

http://regex101.com/r/kM7rT8/14

答案 1 :(得分:2)

假设名称部分不能包含任何尖括号,您可以使用:

(?<=\bFrom:)[^>]+>

注意:如果需要,除了不区分大小写的选项外,您不需要特定的选项才能使其正常工作。

如果您想要一次性提取姓名和电子邮件,可以使用:

\bFrom:\s*(?:"(?<name>[^"]+)"|(?<name>[^<]+?))\s+<\s*(?<email>[^>]+?)\s*>