我需要支持解析电子邮件正文中的xml,但在开头和结尾都有额外的文本。
我已经尝试过HTML敏捷包,但这并没有删除非xml文本。
那么如何清理字符串w / c包含与其周围的其他文本混合的整个xml文本?
var bodyXmlPart= @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
//How do I clean up the body xml part before loading into xml
//This will fail:
var xDoc = XDocument.Parse(bodyXmlPart);
答案 0 :(得分:1)
我可能会做这样的事情......
using System.Diagnostics;
using System.Text.RegularExpressions;
namespace Test {
class Program {
static void Main(string[] args) {
var bodyXmlPart = @"Hi please see below client <?xml version=""1.0"" encoding=""UTF-8""?>" +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
Regex regex = new Regex(@"(?<pre>.*)(?<xml>\<\?xml.*</ac_application\>)(?<post>.*)", RegexOptions.Singleline);
var match = regex.Match(bodyXmlPart);
if (match.Success) {
Debug.WriteLine($"pre={match.Groups["pre"].Value}");
Debug.WriteLine($"xml={match.Groups["xml"].Value}");
Debug.WriteLine($"post={match.Groups["post"].Value}");
}
}
}
}
这输出......
pre=Hi please see below client
xml=<?xml version="1.0" encoding="UTF-8"?><ac_application> <primary_applicant_data> <first_name>Ross</first_name> <middle_name></middle_name> <last_name>Geller</last_name> <ssn>123456789</ssn> </primary_applicant_data></ac_application>
post= thank you,
john
答案 1 :(得分:1)
如果您的意思是该主体可以包含任何XML,而不仅仅是 ac_application 。您可以使用以下代码:
var bodyXmlPart = @"Hi please see below client " +
"<ac_application>" +
" <primary_applicant_data>" +
" <first_name>Ross</first_name>" +
" <middle_name></middle_name>" +
" <last_name>Geller</last_name>" +
" <ssn>123456789</ssn>" +
" </primary_applicant_data>" +
"</ac_application> thank you, \n john ";
StringBuilder pattern = new StringBuilder();
Regex regex = new Regex(@"<\?xml.*\?>", RegexOptions.Singleline);
var match = regex.Match(bodyXmlPart);
if (match.Success) // There is an xml declaration
{
pattern.Append(@"<\?xml.*");
}
Regex regexFirstTag = new Regex(@"\s*<(\w+:)?(\w+)>", RegexOptions.Singleline);
var match1 = regexFirstTag.Match(bodyXmlPart);
if (match1.Success) // xml has body and we got the first tag
{
pattern.Append(match1.Value.Trim().Replace(">",@"\>" + ".*"));
string firstTag = match1.Value.Trim();
Regex regexFullXmlBody = new Regex(pattern.ToString() + @"<\/" + firstTag.Trim('<','>') + @"\>", RegexOptions.None);
var matchBody = regexFullXmlBody.Match(bodyXmlPart);
if (matchBody.Success)
{
string xml = matchBody.Value;
}
}
此代码可以提取任何XML而不仅仅是ac_application。
假设是,正文将始终包含XML声明标记。 此代码将查找XML声明标记,然后在其后面找到第一个标记。第一个标记将被视为根标记以提取整个xml。