使用搜索字符串提取段落

时间:2011-04-01 08:00:04

标签: c# asp.net search-engine

我使用以下代码提取匹配字符串的段落。

int charBeforeAndAfter = 100;
        string matchParagraphs = string.Empty;
                        Regex wordMatch = new Regex(@"\b" + word + @"\b", RegexOptions.IgnoreCase);
            foreach (string paragraph in text.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
            {
                int startIdx = -1;
                int length = -1;
                foreach (Match match in wordMatch.Matches(paragraph))
                {
                    int wordIdx = match.Index;
                    if (wordIdx >= startIdx && wordIdx <= startIdx + length)
                        continue;
                    startIdx = wordIdx > charBeforeAndAfter ? wordIdx - charBeforeAndAfter : 0;
                    length = wordIdx + match.Length + charBeforeAndAfter < paragraph.Length ? match.Length + charBeforeAndAfter : paragraph.Length - startIdx;
                    string extract = wordMatch.Replace(paragraph.Substring(startIdx, length), "<b>" + match.Value + "</b>");
                    matchParagraphs = "..." + extract + "...";
                    return matchParagraphs;
                }
            }   

我得到了正确的结果,但是我在开始和结束段落中得到了易碎的词语,例如“......区域使用和布尔连接器来指定区域如此窄......” < / p>

如何避免那些破碎的话......请帮助我

提前致谢...

2 个答案:

答案 0 :(得分:4)

您可以尝试这样的事情:

using System;
using System.Text.RegularExpressions;

static class Program {

    static void Main(params string[] args) {

        string text = @"Lorem ipsum dolor sit amet, consectetur adipisicing 
elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim 
ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquid ex ea 
commodo consequat.";

        ExtractParagraph(text, "magna");
        ExtractParagraph(text, "ipsum");
        ExtractParagraph(text, "ut");

    }

    static void ExtractParagraph(string text, string word) {
        Console.WriteLine("Matches for: {0}", word);
        string expression = @"((^.{0,30}|\w*.{30})\b" + word + @"\b(.{30}\w*|.{0,30}$))";
        Regex wordMatch = new Regex(expression, RegexOptions.IgnoreCase | RegexOptions.Singleline);
        foreach (Match m in wordMatch.Matches(text)) {
            Console.WriteLine("  {0}", m.Value);
        }
    }

}

基本想法是匹配单词周围的额外内容:.*{30}\bword\b.*{30}然后添加一些“单词字符”,不要将单词缩减为一半:\w*.*{30}\bword\b.*{30}\w*

^.{0,30}.{0,30}$之类的作品。即使句子的开头或结尾少于30个字符,{{1}}也要匹配。

与正则表达式一样,这不太可能赢得可读性竞赛,但似乎有效......

答案 1 :(得分:0)

我有类似的问题。我用这个来解决它:

int len = 50;
int length = 50;    
while (text.substring(0, length).length == length)

{
    if (text.substring(0, length).endsWith(" "))
    {
            var out = 'what you want to output'
            break
    }
    else
    {
            length--;
            if (length < 10) break;
    }
}
return out;

它不是最好的解决方案,但它可以很好地满足我的需求。基本上它只是运行我的代码并检查它是否低于50个字符。打印50以下的任何东西,最后的东西是空格。