我正在创建包含任意数量字符(人物角色/声音)的文档(请参阅this),如下所示:
<span class="sam" title="This is Sam speaking">
<span class="higbie" title="This is Calvin Higbie speaking">
<span class="ballou" title="This is Mr. Ballou speaking">
对于某些上下文,这是一个文档片段:
<p><span class="others" title="This is 'an elderly pilgrim' speaking">"Jack, do you see that range of mountains over yonder that bounds the Jordan valley? The mountains of Moab, Jack! Think of it, my
boy--the actual mountains of Moab--renowned in Scripture history!
We are actually standing face to face with those illustrious crags
and peaks--and for all we know" [dropping his voice impressively],
"our eyes may be resting at this very moment upon the spot WHERE
LIES THE MYSTERIOUS GRAVE OF MOSES! Think of it, Jack!"</span></p>
当文档完成时,我想生成这种标记模式的不同列表。 IOW,我想检查遵循该模式的每一段HTML,但只返回每个不同的人/说话者的一个实例。我不想要其中的400个:
<span class="sam" title="This is Sam speaking">
...(只有一个)。
在伪SQL术语中,我想要的是:
SELECT DISTINCT SOMETHING FROM FILE WHERE SLIDING_WINDOW_OF_TEXT STARTSWITH("<span class=\"") AND SLIDING_WINDOW_OF_TEXT ENDSWITH(" speaking\">")
我不知道这是否是使用正则表达式进行攻击的最佳方法,或者是否有类似于&#34; LinqToText&#34;或其他内容......
答案 0 :(得分:1)
我建议您查看Html Agility Pack,这样可以查询html。以下是一个例子:( Write query to parse HTML DOCUMENT with HtmlAgilityPack。)
您也可以使用LinqToXml将html元素作为xml节点进行查询。
答案 1 :(得分:1)
这并不难。您可以使用LINQ
获取Distinct()
值。添加引用和using System.Linq;
/ using System.Xml.Linq;
。这是一个工作样本(在VS2012中测试):
var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled);
var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>";
var distinct_values = MyRegex.Matches(str).Cast<Match>().Select(p => p.Value).Distinct().ToList();
返回3(不是8)匹配:
如果你不能使用LINQ(例如在Mono中),你可以使用以下代码来利用List<string>
中的System.Collections.Generic
:
using System.IO;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled);
var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>";
// var distinct_values = MyRegex.Matches(str).
// Cast<Match>().Select(p => p.Value).Distinct().ToList();
var new_arr = new List<string>();
var matches = MyRegex.Matches(str);
for (int i=0; i<matches.Count; i++)
if (!new_arr.Contains(matches[i].Value))
new_arr.Add(matches[i].Value);
Console.WriteLine(string.Join("\n", new_arr));
}
}
输出:
<span class="others" title="This is 'an elderly pilgrim' speaking">
<span class="higbie" title="This is Calvin Higbie speaking">
<span class="ballou" title="This is Mr. Ballou speaking">