在一个非常典型的场景中,我的Web应用程序上有一个“搜索”文本框,其中的用户输入直接传递给存储过程,然后使用全文索引搜索两个表中的两个字段,这些表使用适当的钥匙。
我使用CONTAINS谓词来搜索字段。在传递搜索字符串之前,我执行以下操作:
SET @ftQuery = '"' + REPLACE(@query,' ', '*" OR "') + '*"'
例如,将城堡更改为“*”或“城堡*”。这是必要的,因为我希望人们能够搜索 cas 并获得 castle 的结果。
WHERE CONTAINS(Building.Name, @ftQuery) OR CONTAINS(Road.Name, @ftQuery)
问题在于,现在我已经在每个单词的末尾添加了通配符,因此噪音词(例如 )也会附加一个通配符,因此不会再出现掉线。这意味着搜索城堡将返回包含 theatre 等字词的项目。
将OR更改为AND是我的第一个想法,但如果在查询中使用了干扰词,则似乎只返回不匹配。
我想要实现的只是允许用户以任何顺序输入多个空格分隔的单词,这些单词以任意顺序呈现要搜索的单词的全部或前缀 - 并删除诸如之类的单词来自他们输入的(否则当他们搜索城堡时,他们会得到一个大项目列表,结果他们需要在列表中间的某个位置。
我可以继续执行我自己的干扰消除程序,但似乎全文索引应该能够处理。
感谢任何帮助!
杰米
答案 0 :(得分:5)
在存储索引之前,会删除噪音词。所以不可能编写一个搜索停用词的查询。如果您真的想要启用此行为,则需要编辑停用词列表。 (http://msdn.microsoft.com/en-us/library/ms142551.aspx)然后重新构建索引。
答案 1 :(得分:1)
我有同样的问题,经过彻底的搜索,我得出的结论是没有好的解决方案。
作为妥协,我正在实施暴力解决方案:
1)打开C:\ Program Files \ Microsoft SQL Server \ MSSQL.1 \ MSSQL \ FTData \ noiseENU.txt并复制其中的所有文本。
2)粘贴到应用程序中的代码文件中,用“,”替换换行符以获得这样的List初始化程序:
public static List<string> _noiseWords = new List<string>{ "about", "1", "after", "2", "all", "also", "3", "an", "4", "and", "5", "another", "6", "any", "7", "are", "8", "as", "9", "at", "0", "be", "$", "because", "been", "before", "being", "between", "both", "but", "by", "came", "can", "come", "could", "did", "do", "does", "each", "else", "for", "from", "get", "got", "has", "had", "he", "have", "her", "here", "him", "himself", "his", "how", "if", "in", "into", "is", "it", "its", "just", "like", "make", "many", "me", "might", "more", "most", "much", "must", "my", "never", "no", "now", "of", "on", "only", "or", "other", "our", "out", "over", "re", "said", "same", "see", "should", "since", "so", "some", "still", "such", "take", "than", "that", "the", "their", "them", "then", "there", "these", "they", "this", "those", "through", "to", "too", "under", "up", "use", "very", "want", "was", "way", "we", "well", "were", "what", "when", "where", "which", "while", "who", "will", "with", "would", "you", "your", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z" };
3)在提交搜索字符串之前,将其分解为单词并删除噪音字中的任何单词,如下所示:
List<string> goodWords = new List<string>();
string[] words = searchString.Split(' ');
foreach (string word in words)
{
if (!_noiseWords.Contains(word))
goodWords.Add(word);
}
不是理想的解决方案,但只要噪音词文件不会改变就应该有效。多语言支持将按语言使用列表字典。
答案 2 :(得分:1)
这是一个有效的功能。文件noiseENU.txt
按原样从\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\FTData
复制。
Public Function StripNoiseWords(ByVal s As String) As String
Dim NoiseWords As String = ReadFile("/Standard/Core/Config/noiseENU.txt").Trim
Dim NoiseWordsRegex As String = Regex.Replace(NoiseWords, "\s+", "|") ' about|after|all|also etc.
NoiseWordsRegex = String.Format("\s?\b(?:{0})\b\s?", NoiseWordsRegex)
Dim Result As String = Regex.Replace(s, NoiseWordsRegex, " ", RegexOptions.IgnoreCase) ' replace each noise word with a space
Result = Regex.Replace(Result, "\s+", " ") ' eliminate any multiple spaces
Return Result
End Function
答案 3 :(得分:1)
您还可以在进行查询之前删除干扰词。 语言ID列表:http://msdn.microsoft.com/en-us/library/ms190303.aspx
Dim queryTextWithoutNoise As String = removeNoiseWords(queryText,ConnectionString,1033)
公共函数removeNoiseWords(ByVal inputText As String, ByVal cnStr As String, ByVal languageID As Integer)As String
Dim r As New System.Text.StringBuilder
Try
If inputText.Contains(CChar("""")) Then
r.Append(inputText)
Else
Using cn As New SqlConnection(cnStr)
Const q As String = "SELECT display_term,special_term FROM sys.dm_fts_parser(@q,@l,0,0)"
cn.Open()
Dim cmd As New SqlCommand(q, cn)
With cmd.Parameters
.Add(New SqlParameter("@q", """" & inputText & """"))
.Add(New SqlParameter("@l", languageID))
End With
Dim dr As SqlDataReader = cmd.ExecuteReader
While dr.Read
If Not (dr.Item("special_term").ToString.Contains("Noise")) Then
r.Append(dr.Item("display_term").ToString)
r.Append(" ")
End If
End While
End Using
End If
Catch ex As Exception
' ...
End Try
Return r.ToString
End Function
答案 4 :(得分:0)
与我的方法相似。
虽然我希望使用全文索引来执行词干分析,它的速度和多字搜索等等,但我实际上只是在两个表中索引几个nvarchar(100)字段。每个表格都可以轻松保持在50,000行以下。
我的解决方案是从文本文件中删除所有干扰词,并允许索引器编译包含所有单词的索引。它仍然只包含几千个条目。
然后我按照原始帖子中的说明对搜索字符串中的空格进行替换,以使CONTAINS处理多个单词,并单独使用单词。
似乎工作得很好,但我会密切注意表现。