我有以下字符串:
"<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a> <a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13978'> [remove]</a><br /><a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a> <a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13979'> [remove]</a><br /><a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a> <a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13980'> [remove]</a>"
如果你要很好地格式化,你会看到类似的东西:
<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg|image/jpeg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13978'> [remove]</a>
<br />
<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx|application/vnd.openxmlformats-officedocument.wordprocessingml.document</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13979'> [remove]</a>
<br />
<a href=\"/formentries/formfile/13980\" target=\"_blank\">dog-02.png|image/png</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13980'> [remove]</a>
所以我有一堆锚标签,它们之间有断点。在每个锚文本中,我想删除管道字符和文件类型:
狗00.jpg |图像/ JPEG
变为
狗00.jpg
正则表达式也适用于所有未来的文件类型,例如:
狗01.docx |应用/ vnd.openxmlformats-officedocument.wordprocessingml.document
变为
狗01.docx
我仍然需要完整的锚点,因此在删除文件类型后,文本变为:
<a href=\"/formentries/formfile/13978\" target=\"_blank\">dog-00.jpg</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13978'> [remove]</a>
<br />
<a href=\"/formentries/formfile/13979\" target=\"_blank\">dog-01.docx</a>
<a href='/FormEntries/Delete' class='btnDeleteAttachment' data-form-entry-id='366793' data-attachment-id='13979'> [remove]</a>
<br />
我对Regex并不擅长,但我尝试了各种组合都无法匹配
答案 0 :(得分:1)
不要使用正则表达式来解析复杂的HTML,您可以使用HtmlAgilityPack
。我还使用Contains
,IndexOf
和Remove
等字符串方法代替正则表达式:
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html); // pass in your HTML string
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
string text = link.InnerText;
if (text.Contains('|'))
link.InnerHtml = text.Remove(text.IndexOf('|')); // you can't modify InnerText directly but this works
}
string result = doc.DocumentNode.OuterHtml; // your desired result
答案 1 :(得分:0)
输入:
dog-00.jpg|image/jpeg
仅匹配|
管道前部分的正则表达式:
([^|]+)
描述:
上面的正则表达式匹配所有内容,直到出现第一个管道字符。
C#代码:
var input = @"dog-00.jpg|image/jpeg";
var regex = new Regex(@"([^|]+)");
var m = regex.Match(input);
string name = null;
if (m.Success)
{
name = m.Groups[1].Value;
}
编辑:
如果这只是通过管道字符拆分字符串,那么带有input.Split
(或.Substring
+ .IndexOf
)的Dylan Nicholson变体可能比正则表达式更具性能。
EDIT2:
是否需要正则表达式?如果没有,请尝试以下方法:
public static string Clean(string input)
{
var sb = new StringBuilder(input);
int m1 = -1, m2 = -1;
for(var i = 0; i < sb.Length; i++)
{
if (sb[i] == '|')
m1 = i;
if (sb[i] == '<')
m2 = i;
if (m1 > -1 && m2 > -1 && m2 > m1)
{
sb.Remove(m1, m2 - m1);
i = m1;
m1 = -1;
m2 = -1;
}
}
return sb.ToString();
}
答案 2 :(得分:0)
<强>更新强>
您可以使用此正则表达式:
(?<=<a[^>]*>[^|]+?)\|.*?(?=</a>)
对于C#:
your_string = Regex.Replace(your_string, "(?<=<a[^>]*>[^|]+?)\\|.*?(?=</a>)", "",
RegexOptions.IgnoreCase | RegexOptions.Multiline);
只需使用此正则表达式替换字符串。