使dtSearch突出显示每个短语的一个匹配,而不是每个单词一个短语

时间:2010-04-26 20:01:34

标签: dtsearch hit-highlighting

我正在使用dtSearch突出显示文档中的文本搜索匹配项。执行此操作的代码,减去一些细节和清理,大致如下:

SearchJob sj = new SearchJob();
sj.Request = "\"audit trail\""; // the user query
sj.FoldersToSearch.Add(path_to_src_document);
sj.Execute();
FileConverter fileConverter = new FileConverter();
fileConverter.SetInputItem(sj.Results, 0);
fileConvert.BeforeHit = "<a name=\"HH_%%ThisHit%%\"/><b>";
fileConverter.AfterHit = "</b>";
fileConverter.Execute();
string myHighlightedDoc = fileConverter.OutputString;

如果我给dtSearch一个引用的短语查询,如

  

“审计线索”

然后dtSearch会像这样点击突出显示:

  

&lt; a name =“HH_0”/&gt;&lt; b&gt; audit&lt; / b&gt; &lt; a name =“HH_1”/&gt;&lt; b&gt; trail&lt; / b&gt;有一个&lt; a name =“HH_2”/&gt;&lt; b&gt;审核&lt; / b&gt;是一件很有趣的事情。 &lt; a name =“HH_last”/&gt;&lt; b&gt; trail&lt; / b&gt;大约!

请注意,短语的每个单词都会单独突出显示。相反,我希望短语能够突出显示为整个单元,如下所示:

  

&lt; a name =“HH_0”/&gt;&lt; b&gt;审核线索&lt; / b&gt;有一个&lt; a name =“HH_last”/&gt;&lt; b&gt;审计线索&lt; / b&gt;是一件很有趣的事情。大约!

这将A)使突出显示看起来更好,B)改善我的javascript行为,帮助用户从命中导航到命中,以及C)提供更准确的总#hits计数。

有没有很好的方法让dtSearch以这种方式突出显示短语?

1 个答案:

答案 0 :(得分:2)

注意:我认为这里的文字和代码可以使用更多的工作。如果人们想帮助修改答案或代码,这可能会成为社区维基。

我向dtSearch询问了这个问题(2010年4月26日)。他们的回答是两部分的:

首先,只有通过改变标志才能获得所需的突出显示行为

其次, 可以获得一些较低级别的匹配信息,其中词组匹配被视为整体。特别是如果您在SearchJob中同时设置了dtsSearchWantHitsByWord和dtsSearchWantHitsArray标记,那么您的搜索结果将使用查询中每个单词或短语匹配的单词偏移量进行注释。例如,如果您的输入文档是

  

审计跟踪是一个有趣的事情来进行审计跟踪!

,您的查询是

  

“审计线索”

然后(在.NET API中),sj.Results.CurrentItem.HitsByWord [0]将包含如下字符串:

  

审计跟踪(2 11)

表示从文档中的第二个单词和第11个单词开始,找到短语“审计跟踪”。

您可以使用此信息做的一件事是创建一个“跳过列表”,指示哪些dtSearch重点无关紧要(即哪些是短语延续,而不是单词或短语的开头)。例如,如果你的跳过列表是[4,7,9],那可能意味着第4,第7和第9次命中是微不足道的,而其他命中是合法的。这种“跳过列表”可以至少以两种方式使用:

  1. 您可以更改从点击导航到点击的代码,以便跳过点击数量if if skipList.contains(i)。
  2. 根据要求,您还可以重写dtSearch FileConverter生成的HTML。在我的情况下,我有dtSearch注释点击类似&lt; name =“HH_1”/&gt;&lt; span class =“highlight”&gt; hitword&lt; / span&gt;,并使用A标签(以及它们是的事实顺序编号--HH_1,HH_2,HH_3等)作为命中导航的基础。所以我尝试过,取得了一些成功,就是走HTML,并删除所有A标签,其中HH_i中的i在跳过列表中有特色。根据您的热门导航代码,您可能需要重新编号A标签,以便在HH_1和HH_3之间没有任何差距。
  3. 假设这些“跳过列表”确实有用,你会如何生成它们?那么这里有一些主要有用的代码:

    using System;
    using System.Collections.Generic;
    using System.IO;
    using System.Text;
    using System.Text.RegularExpressions;
    using NUnit.Framework;
    
    public class DtSearchUtil
    {
        /// <summary>
        /// Makes a "skip list" for the dtSearch result document with the specified
        /// WordArray data. The skip list indicates which hits in the dtSearch markup
        /// should be skipped during hit navigation. The reason to skip some hits
        /// is to allow navigation to be phrase aware, rather than forcing the user
        /// to visit each word in the phrase as if it were an independent hit.
        /// The skip list consists of 1-indexed hit offsets. 2, for example, would
        /// mean that the second hit should be skipped during hit navigation.
        /// </summary>
        /// <param name="dtsHitsByWordArray">dtSearch HitsByWord data. You'll get this from SearchResultItem.HitsByWord
        /// if you did your search with the dtsSearchWantHitsByWord and dtsSearchWantHitsArray
        /// SearchFlags.</param>
        /// <param name="userHitCount">How many total hits there are, if phrases are counted
        /// as one hit each.</param>
        /// <returns></returns>
        public static List<int> MakeHitSkipList(string[] dtsHitsByWordArray, out int userHitCount)
        {
            List<int> skipList = new List<int>();
            userHitCount = 0;
    
            int curHitNum = 0; // like the dtSearch doc-level highlights, this counts hits word-by-word, rather than phrase by phrase
            List<PhraseRecord> hitRecords = new List<PhraseRecord>();
            foreach (string dtsHitsByWordString in dtsHitsByWordArray)
            {
                hitRecords.Add(PhraseRecord.ParseHitsByWordString(dtsHitsByWordString));
            }
            int prevEndOffset = -1;
    
            while (true)
            {
                int nextOffset = int.MaxValue;
                foreach (PhraseRecord rec in hitRecords)
                {
                    if (rec.CurOffset >= rec.OffsetList.Count)
                        continue;
    
                    nextOffset = Math.Min(nextOffset, rec.OffsetList[rec.CurOffset]);
                }
                if (nextOffset == int.MaxValue)
                    break;
    
                userHitCount++;
    
                PhraseRecord longestMatch = null;
                for (int i = 0; i < hitRecords.Count; i++)
                {
                    PhraseRecord rec = hitRecords[i];
                    if (rec.CurOffset >= rec.OffsetList.Count)
                        continue;
                    if (nextOffset == rec.OffsetList[rec.CurOffset])
                    {
                        if (longestMatch == null ||
                            longestMatch.LengthInWords < rec.LengthInWords)
                        {
                            longestMatch = rec;
                        }
                    }
                }
    
                // skip subsequent words in the phrase
                for (int i = 1; i < longestMatch.LengthInWords; i++)
                {
                    skipList.Add(curHitNum + i);
                }
    
                prevEndOffset = longestMatch.OffsetList[longestMatch.CurOffset] +
                    (longestMatch.LengthInWords - 1);
    
                longestMatch.CurOffset++;
    
                curHitNum += longestMatch.LengthInWords;
    
                // skip over any unneeded, overlapping matches (i.e. at the same offset)
                for (int i = 0; i < hitRecords.Count; i++)
                {
                    while (hitRecords[i].CurOffset < hitRecords[i].OffsetList.Count &&
                        hitRecords[i].OffsetList[hitRecords[i].CurOffset] <= prevEndOffset)
                    {
                        hitRecords[i].CurOffset++;
                    }
                }
            }
    
            return skipList;
        }
    
        // Parsed form of the phrase-aware hit offset stuff that dtSearch can give you 
        private class PhraseRecord
        {
            public string PhraseText;
    
            /// <summary>
            /// Offsets into the source text at which this phrase matches. For example,
            /// offset 300 would mean that one of the places the phrase matches is
            /// starting at the 300th word in the document. (Words are counted according
            /// to dtSearch's internal word breaking algorithm.)
            /// See also:
            /// http://support.dtsearch.com/webhelp/dtSearchNetApi2/frames.html?frmname=topic&frmfile=dtSearch__Engine__SearchFlags.html
            /// </summary>
            public List<int> OffsetList;
    
            // BUG: We calculate this with a whitespace tokenizer. This will probably
            // cause bad results in some places. (Better to figure out how to count
            // the way dtSearch would.)
            public int LengthInWords
            {
                get
                {
                    return Regex.Matches(PhraseText, @"[^\s]+").Count;
                }
            }
    
            public int CurOffset = 0;
    
            public static PhraseRecord ParseHitsByWordString(string dtsHitsByWordString)
            {
                Match m = Regex.Match(dtsHitsByWordString, @"^([^,]*),\s*\d*\s*\(([^)]*)\).*");
                if (!m.Success)
                    throw new ArgumentException("Bad dtsHitsByWordString. Did you forget to set dtsHitsByWordString in dtSearch?");
    
                string phraseText = m.Groups[1].Value;
                string parenStuff = m.Groups[2].Value;
    
                PhraseRecord hitRecord = new PhraseRecord();
                hitRecord.PhraseText = phraseText;
                hitRecord.OffsetList = GetMatchOffsetsFromParenGroupString(parenStuff);
                return hitRecord;
            }
    
            static List<int> GetMatchOffsetsFromParenGroupString(string parenGroupString)
            {
                List<int> res = new List<int>();
                MatchCollection matchCollection = Regex.Matches(parenGroupString, @"\d+");
                foreach (Match match in matchCollection)
                {
                    string digitString = match.Groups[0].Value;
                    res.Add(int.Parse(digitString));
                }
                return res;
            }
        }
    }
    
    
    [TestFixture]
    public class DtSearchUtilTests
    {
        [Test]
        public void TestMultiPhrasesWithoutFieldName()
        {
            string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 );",
                @"bana*, 4 (490 505 689 713 )"
                };
    
            // expected dtSearch hit order:
            // 0: apple@482
            // 1: pie@483 [should skip]
            // 2: banana-something@490
            // 3: apple@499
            // 4: pie@500 [should skip]
            // 5: banana-something@505
            // 6: apple@552
            // 7: pie@553 [should skip]
            // 8: apple@578
            // 9: pie@579 [should skip]
            // 10: apple@589
            // 11: pie@590 [should skip]
            // 12: apple@683
            // 13: pie@684 [skip]
            // 14: banana-something@689
            // 15: apple@706
            // 16: pie@707 [skip]
            // 17: banana-something@713
    
            int userHitCount;
            List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
    
            Assert.AreEqual(11, userHitCount);
    
            Assert.AreEqual(1, skipList[0]);
            Assert.AreEqual(4, skipList[1]);
            Assert.AreEqual(7, skipList[2]);
            Assert.AreEqual(9, skipList[3]);
            Assert.AreEqual(11, skipList[4]);
            Assert.AreEqual(13, skipList[5]);
            Assert.AreEqual(16, skipList[6]);
            Assert.AreEqual(7, skipList.Count);
        }
    
        [Test]
        public void TestPhraseOveralap1()
        {
            string[] foo = { @"apple pie, 7 (482 499 552 );",
                @"apple, 4 (482 490 499 552)"
                };
    
            // expected dtSearch hit order:
            // 0: apple@482
            // 1: pie@483 [should skip]
            // 2: apple@490
            // 3: apple@499
            // 4: pie@500 [should skip]
            // 5: apple@552
            // 6: pie@553 [should skip]
    
            int userHitCount;
            List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
    
            Assert.AreEqual(4, userHitCount);
    
            Assert.AreEqual(1, skipList[0]);
            Assert.AreEqual(4, skipList[1]);
            Assert.AreEqual(6, skipList[2]);
            Assert.AreEqual(3, skipList.Count);
        }
    
        [Test]
        public void TestPhraseOveralap2()
        {
            string[] foo = { @"apple pie, 7 (482 499 552 );",
    @"pie, 4 (483 490 500 553)"
        };
    
            // expected dtSearch hit order:
            // 0: apple@482
            // 1: pie@483 [should skip]
            // 2: pie@490
            // 3: apple@499
            // 4: pie@500 [should skip]
            // 5: apple@552
            // 6: pie@553 [should skip]
    
            int userHitCount;
            List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
    
            Assert.AreEqual(4, userHitCount);
    
            Assert.AreEqual(1, skipList[0]);
            Assert.AreEqual(4, skipList[1]);
            Assert.AreEqual(6, skipList[2]);
            Assert.AreEqual(3, skipList.Count);
        }
    
        // TODO: test "apple pie" and "apple", plus "apple pie" and "pie"
    
        // "subject" should not freak it out
        [Test]
        public void TestSinglePhraseWithFieldName()
        {
            string[] foo = { @"apple pie, 7 (482 499 552 578 589 683 706 ), subject" };
    
            int userHitCount;
            List<int> skipList = DtSearchUtil.MakeHitSkipList(foo, out userHitCount);
    
            Assert.AreEqual(7, userHitCount);
    
            Assert.AreEqual(7, skipList.Count);
            Assert.AreEqual(1, skipList[0]);
            Assert.AreEqual(3, skipList[1]);
            Assert.AreEqual(5, skipList[2]);
            Assert.AreEqual(7, skipList[3]);
            Assert.AreEqual(9, skipList[4]);
            Assert.AreEqual(11, skipList[5]);
            Assert.AreEqual(13, skipList[6]);
        }
    }