如何提取HTML标签的src属性?

时间:2015-02-16 12:22:42

标签: c# asp.net .net

我有一个HTML格式的字符串

<div class="ExternalClass6FC23FEAF7454B3A8006CF7E1D2257B8">
<audio src="/sites/audioblogs/Group2Doc/0.021950338035821915.wav"   controls="controls"></audio><br/><img   src="/sites/audioblogs/Group2Doc/20140103_152938.jpg" alt=""/></div>

我只需要source(src)属性, 我正在尝试使用Regex.Match,

还有其他选择吗?

谢谢, 萨钦

2 个答案:

答案 0 :(得分:2)

我使用HtmlAgilityPack来解析HTML,而不是正则表达式:

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);  // html is your string
var audio = doc.DocumentNode.Descendants("audio")
    .FirstOrDefault(n => n.Attributes["src"] != null);
string src = null;
if (audio != null)
    src = audio.Attributes["src"].Value;  

结果:/sites/audioblogs/Group2Doc/0.021950338035821915.wav

答案 1 :(得分:0)

string yourFullHtmlstring = ".....";
//will make sure all of your double quotes are single quotes
yourFullHtmlstring= yourFullHtmlstring.Replace("\"", "'");

//will turn it into array
string[] arr = yourFullHtmlstring.Split( new string[] {"src='"}, StringSplitOptions.None);

//this will trim the sources found only to the source value.
//start from 1 because we skip the first entry before the first src
for (int i = 1; i < arr.Length; i++)
{
    arr[i] = arr[i].Substring(0, arr[i].IndexOf("'"));
}