突出显示正则表达式匹配项中的单词

时间:2018-10-29 18:31:02

标签: c# regex

我正在尝试使用Regex在段落中搜索某些文本。我希望现实主义者在返回前后的X个单词,并在所有出现的文本周围添加高亮显示。

例如: 考虑以下段落。结果前后至少应有10个字符,且没有单词被截断。搜索词是“狗”。

  

狗是宠物。它是最听话的动物之一。那里   世界上有很多种狗。有些非常友好   而其中一些很危险。狗有不同的颜色,例如   黑色,红色,白色和棕色。有些老了,皮肤光滑有光泽   有些皮肤粗糙。狗是食肉动物。他们喜欢   吃肉。他们有四只腿,两只耳朵和一条尾巴。狗是   经过训练可以执行不同的任务。他们保护我们免受小偷的侵害b)   守卫我们的房子。他们是爱动物。狗叫人的   最好的朋友。警察用它们来发现隐藏的东西。他们   是世界上最有用的动物之一。 Doggonit!

我想要的结果是一个数组,如下所示:

  • 是宠物
  • 世界上许多
  • 危险。 不同
  • 皮肤粗糙。 是肉食性的
  • 和一条尾巴。 受过训练
  • 动物。 被称为
  • 世界。 死角!

我所得到的:

我到处搜索并找到以下正则表达式,该正则表达式可以完美地返回所需的结果,但没有添加额外的格式。我创建了几种方法来简化每种功能:

private List<List<string>> Search(string text, string searchTerm, bool searchEntireWord) {
    var result = new List<List<string>>();
    var searchTerms = searchTerm.Split(' ');
        foreach (var word in searchTerms) {
            var searchResults = ExtractParagraph(text, word, sizeOfResult, searchEntireWord);
            result.Add(searchResults);
            if (searchResults.Count > 0) {
                foreach (var searchResult in searchResults) {
                    Response.Write("<strong>Result:</strong> " + searchResult + "<br>");
                }
            }
        }
    return result;
}

private List<string> ExtractParagraph(string text, string searchTerm, sizeOfResult, bool searchEntireWord) {
    var result = new List<string>();
    searchTerm = searchEntireWord ? @"\b" + searchTerm + @"\b" : searchTerm;
    //var expression = @"((^.{0,30}|\w*.{30})\b" + searchTerm + @"\b(.{30}\w*|.{0,30}$))";
    var expression = @"((^.{0," + sizeOfResult + @"}|\w*.{" + sizeOfResult + @"})" + searchTerm + @"(.{" + sizeOfResult + @"}\w*|.{0," + sizeOfResult + @"}$))";
    var wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);

    foreach (Match m in wordMatch.Matches(text)) {
        result.Add(m.Value);
    }
    return result;
}

我可以这样称呼它:

var text = "The Dog is a pet animal. It is one of...";
var searchResults = Search(text, "dog", 10);
if (searchResults.Count > 0) {
    foreach (var searchResult in searchResults) {
        foreach (var result in searchResult) {
            Response.Write("<strong>Result:</strong> " + result + "<br>");
        }
    }
}

我还不知道该单词在10个字符内的多次出现的结果或如何处理。即:如果句子中有“狗当然就是狗!”。我想以后可以解决。

测试

var searchResults = Search(text, "dog", 0, false); // should include only the matched word
var searchResults = Search(text, "dog", 1, false); // should include the matched word and only one word preceding and following the matched word (if any)
var searchResults = Search(text, "dog", 10, false); // should include the matched word and up to 10 characters (but not cutting off words in the middle) preceding and following it (if any)
var searchResults = Search(text, "dog", 50, false); // should include the matched word and up to 50 characters (but not cutting off words in the middle) preceding and following it (if any)

问题:

我创建的函数允许搜索查找整个或整个单词的searchTerm。

我正在做的是在显示结果时对结果进行简单的Replace(word, "<strong>" + word "</strong>")处理。如果我要搜索单词的一部分,则效果很好。但是,当搜索整个单词时,如果结果中包含searchTerm作为单词的一部分,则单词的该部分将突出显示。

例如:如果我正在搜索“狗”,结果是:“所有狗都去了狗天堂”。突出显示为“所有都去了天堂”。但是我要“所有的狗都去天堂。”

问题:

问题是,如何获取与<strong>之类的HTML包裹的匹配单词或其他我想要的东西?

2 个答案:

答案 0 :(得分:1)

您的解决方案应该能够做两件事:1)提取匹配项,即关键字/短语以及围绕它们的其他左右上下文,以及2)用标签包装搜索词。

提取正则表达式(例如左右10个字符)为

(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)

请参见regex demo

详细信息

  • (?si)-启用SinglelineIgnoreCase修饰符(.将匹配所有字符,并且模式不区分大小写)
  • (?<!\S)-左侧的空白边界
  • .{0,10}-0至10个字符
  • (?<!\S)-左侧的空白边界
  • \S*dog\S*-dog周围有0+个非空格字符(注意:如果searchEntireWord false ,您需要从该模式部分中删除\S*
  • (?!\S)-右侧的空白边界
  • .{0,10}-0至10个字符
  • (?!\S)-右侧的空白边界。

在C#中,它将定义为

var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
if (searchEntireWord) { 
    expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
} 

请注意,{{实际上是格式化字符串中的文字{}}是文字}

第二个用强标签包装关键术语的正则表达式要简单得多:

Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>")

请注意,替换模式中的$&是指整个匹配值。

C#代码:

public static List<string> ExtractTexts(string text, string searchTerm, int sizeOfResult, bool searchEntireWord) 
{
    var expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S)\S*{1}\S*(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    if (searchEntireWord) { 
        expression = string.Format(@"(?si)(?<!\S).{{0,{0}}}(?<!\S){1}(?!\S).{{0,{0}}}(?!\S)", sizeOfResult, Regex.Escape(searchTerm)); 
    } 
    return Regex.Matches(text, expression) 
        .Cast<Match>() 
        .Select(x => Regex.Replace(x.Value, 
            searchEntireWord ? 
                string.Format(@"(?i)(?<!\S){0}(?!\S)", Regex.Escape(searchTerm)) : 
                string.Format(@"(?i){0}", Regex.Escape(searchTerm)), 
            "<strong>$&</strong>"))
        .ToList();
}

Sample usage (see demo)

var text = "The Dog is a real-pet animal. There's an undogging dog that only undogs non-dogs. It is one of the most obedient animals. There are many kinds of dogs in the world. Some of the are very friendly while some of them a dangerous. Dogs are of different color like black, red, white and brown. Some old them have slippery shiny skin and some have rough skin. Dogs are carnivorous animals. They like eating meat. They have four legs, two ears and a tail. Dogs are trained to perform different tasks. They protect us from thieves b) guarding our house. They are loving animals. A dog is called man's best friend. They are used by the police to find hidden things. They are one of the most useful animals in the world. Doggonit!";
var searchTerm = "dog";
var searchEntireWord = false;
Console.WriteLine("======= 10 ========");
var results = ExtractTexts(text, searchTerm, 10, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 10 ========
(?si)(?<!\S).{0,10}(?<!\S)\S*dog\S*(?!\S).{0,10}(?!\S)
The <strong>Dog</strong> is a
an un<strong>dog</strong>ging <strong>dog</strong> that
only un<strong>dog</strong>s non-<strong>dog</strong>s.
kinds of <strong>dog</strong>s in the
<strong>Dog</strong>s are of
skin. <strong>Dog</strong>s are
a tail. <strong>Dog</strong>s are
A <strong>dog</strong> is called
world. <strong>Dog</strong>gonit!

另一个例子:

Console.WriteLine("======= 15 ========");
results = ExtractTexts(text, searchTerm, 15, searchEntireWord);
foreach (var result in results)
    Console.WriteLine(result);

输出:

======= 15 ========
(?si)(?<!\S).{0,15}(?<!\S)\S*dog\S*(?!\S).{0,15}(?!\S)
The <strong>Dog</strong> is a real-pet
There's an un<strong>dog</strong>ging <strong>dog</strong> that only
un<strong>dog</strong>s non-<strong>dog</strong>s. It is one of
many kinds of <strong>dog</strong>s in the world.
a dangerous. <strong>Dog</strong>s are of
rough skin. <strong>Dog</strong>s are
and a tail. <strong>Dog</strong>s are trained to
animals. A <strong>dog</strong> is called
in the world. <strong>Dog</strong>gonit!

答案 1 :(得分:0)

使用Regex.Replace的简单解决方案:

public bool HighlightExactMatchOnly(string input, string textToHighlight, string expected)
{
    // given
    var escapedHighlight = Regex.Escape(textToHighlight);

    // when
    var result = Regex.Replace(input, @"\b" + escapedHighlight + @"\b", "<strong>$0</strong>");

    return expected == result;
}

测试:

var text = "My test dogs with a single dog and some text behind";
var expected = "My test dogs with a single <strong>dog</strong> and some text behind";
HighlightExactMatchOnly(text , "dog", expected);

请注意,这不是最快的解决方案。