如何查询文本文件以查找模式的不同实例?

时间:2015-04-01 22:01:58

标签: c# regex

我正在创建包含任意数量字符(人物角色/声音)的文档(请参阅this),如下所示:

<span class="sam" title="This is Sam speaking">
<span class="higbie" title="This is Calvin Higbie speaking">
<span class="ballou" title="This is Mr. Ballou speaking">

对于某些上下文,这是一个文档片段:

  <p><span class="others" title="This is 'an elderly pilgrim' speaking">"Jack, do you see that range of mountains over yonder that bounds the Jordan valley?  The mountains of Moab, Jack!  Think of it, my
  boy--the actual mountains of Moab--renowned in Scripture history!
  We are actually standing face to face with those illustrious crags
  and peaks--and for all we know" [dropping his voice impressively],
  "our eyes may be resting at this very moment upon the spot WHERE
  LIES THE MYSTERIOUS GRAVE OF MOSES!  Think of it, Jack!"</span></p>

当文档完成时,我想生成这种标记模式的不同列表。 IOW,我想检查遵循该模式的每一段HTML,但只返回每个不同的人/说话者的一个实例。我不想要其中的400个:

<span class="sam" title="This is Sam speaking">

...(只有一个)。

在伪SQL术语中,我想要的是:

SELECT DISTINCT SOMETHING FROM FILE WHERE SLIDING_WINDOW_OF_TEXT STARTSWITH("<span class=\"") AND SLIDING_WINDOW_OF_TEXT ENDSWITH("  speaking\">")

我不知道这是否是使用正则表达式进行攻击的最佳方法,或者是否有类似于&#34; LinqToText&#34;或其他内容......

2 个答案:

答案 0 :(得分:1)

我建议您查看Html Agility Pack,这样可以查询html。以下是一个例子:( Write query to parse HTML DOCUMENT with HtmlAgilityPack。)

您也可以使用LinqToXml将html元素作为xml节点进行查询。

答案 1 :(得分:1)

这并不难。您可以使用LINQ获取Distinct()值。添加引用和using System.Linq; / using System.Xml.Linq;。这是一个工作样本(在VS2012中测试):

var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled);
var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>";
var distinct_values = MyRegex.Matches(str).Cast<Match>().Select(p => p.Value).Distinct().ToList();

返回3(不是8)匹配:

enter image description here

NO-LINQ SOLUTION

如果你不能使用LINQ(例如在Mono中),你可以使用以下代码来利用List<string>中的System.Collections.Generic

using System.IO;
using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

class Program
{
    static void Main()
    {
        var MyRegex = new Regex(@"(?i)<span class=([""']).+?\1 title=([""']).+?\2>", RegexOptions.CultureInvariant | RegexOptions.Compiled);
        var str = @"<p><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""others"" title=""This is 'an elderly pilgrim' speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""ballou"" title=""This is Mr. Ballou speaking""><span class=""higbie"" title=""This is Calvin Higbie speaking""></p>";
  //      var distinct_values = MyRegex.Matches(str).
//                    Cast<Match>().Select(p => p.Value).Distinct().ToList();
        var new_arr = new List<string>();
        var matches = MyRegex.Matches(str);
        for (int i=0; i<matches.Count; i++)
            if (!new_arr.Contains(matches[i].Value))
               new_arr.Add(matches[i].Value);

        Console.WriteLine(string.Join("\n", new_arr));
    }
}

输出:

<span class="others" title="This is 'an elderly pilgrim' speaking">                                                                                                 
<span class="higbie" title="This is Calvin Higbie speaking">                                                                                                        
<span class="ballou" title="This is Mr. Ballou speaking">