在SQL Server 2005全文索引中删除干扰词

时间:2009-01-22 15:14:27

标签: sql-server-2005 full-text-search

在一个非常典型的场景中,我的Web应用程序上有一个“搜索”文本框,其中的用户输入直接传递给存储过程,然后使用全文索引搜索两个表中的两个字段,这些表使用适当的钥匙。

我使用CONTAINS谓词来搜索字段。在传递搜索字符串之前,我执行以下操作:

SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'

例如,将城堡更改为“*”或“城堡*”。这是必要的,因为我希望人们能够搜索 cas 并获得 castle 的结果。

WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)

问题在于,现在我已经在每个单词的末尾添加了通配符,因此噪音词(例如 )也会附加一个通配符,因此不会再出现掉线。这意味着搜索城堡将返回包含 theatre 等字词的项目。

将OR更改为AND是我的第一个想法,但如果在查询中使用了干扰词,则似乎只返回不匹配。

我想要实现的只是允许用户以任何顺序输入多个空格分隔的单词,这些单词以任意顺序呈现要搜索的单词的全部或前缀 - 并删除诸如之类的单词来自他们输入的(否则当他们搜索城堡时,他们会得到一个大项目列表,结果他们需要在列表中间的某个位置。

我可以继续执行我自己的干扰消除程序,但似乎全文索引应该能够处理。

感谢任何帮助!

杰米

5 个答案:

答案 0 :(得分:5)

在存储索引之前,会删除噪音词。所以不可能编写一个搜索停用词的查询。如果您真的想要启用此行为,则需要编辑停用词列表。 (http://msdn.microsoft.com/en-us/library/ms142551.aspx)然后重新构建索引。

答案 1 :(得分:1)

我有同样的问题,经过彻底的搜索,我得出的结论是没有好的解决方案。

作为妥协,我正在实施暴力解决方案:

1)打开C:\ Program Files \ Microsoft SQL Server \ MSSQL.1 \ MSSQL \ FTData \ noiseENU.txt并复制其中的所有文本。

2)粘贴到应用程序中的代码文件中,用“,”替换换行符以获得这样的List初始化程序:

public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };

3)在提交搜索字符串之前,将其分解为单词并删除噪音字中的任何单词,如下所示:

List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
   if (!_noiseWords.Contains(word))
      goodWords.Add(word);
}

不是理想的解决方案,但只要噪音词文件不会改变就应该有效。多语言支持将按语言使用列表字典。

答案 2 :(得分:1)

这是一个有效的功能。文件noiseENU.txt按原样从\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData复制。

    Public Function StripNoiseWords(ByVal s As String) As String
        Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
        Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
        NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
        Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
        Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
        Return Result
    End Function

答案 3 :(得分:1)

您还可以在进行查询之前删除干扰词。 语言ID列表:http://msdn.microsoft.com/en-us/library/ms190303.aspx

Dim queryTextWithoutNoise As String = removeNoiseWords(queryText,ConnectionString,1033)

公共函数removeNoiseWords(ByVal inputText As String,                                      ByVal cnStr As String,                                      ByVal languageID As Integer)As String

    Dim r As New System.Text.StringBuilder
    Try
        If inputText.Contains(CChar("""")) Then
            r.Append(inputText)
        Else
            Using cn As New SqlConnection(cnStr)

                Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
                cn.Open()
                Dim cmd As New SqlCommand(q, cn)
                With cmd.Parameters
                    .Add(New SqlParameter("@q", """" & inputText & """"))
                    .Add(New SqlParameter("@l", languageID))
                End With
                Dim dr As SqlDataReader = cmd.ExecuteReader
                While dr.Read
                    If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
                        r.Append(dr.Item("display_term").ToString)
                        r.Append(" ")
                    End If
                End While
            End Using
        End If
    Catch ex As Exception
        ' ...        
    End Try
    Return r.ToString

End Function

答案 4 :(得分:0)

与我的方法相似。

虽然我希望使用全文索引来执行词干分析,它的速度和多字搜索等等,但我实际上只是在两个表中索引几个nvarchar(100)字段。每个表格都可以轻松保持在50,000行以下。

我的解决方案是从文本文件中删除所有干扰词,并允许索引器编译包含所有单词的索引。它仍然只包含几千个条目。

然后我按照原始帖子中的说明对搜索字符串中的空格进行替换,以使CONTAINS处理多个单词,并单独使用单词。

似乎工作得很好,但我会密切注意表现。