我最近发现了n-gram以及比较文本体中短语频率的可能性。现在我正在尝试创建一个vb.net应用程序,只需获取文本正文并返回最常用短语列表(其中n> = 2)。
我找到了一个如何从文本体生成n-gram的C#示例,所以我开始将代码转换为VB。问题是这个代码确实每个字符创建一克而不是每个字一个。我想用于单词的分隔符是:VbCrLf(新行),vbTab(制表符)和以下字符:!@#$%^& *()_ + - = {} | \:\“'? ¿/.,& LT;> '¡º×÷';«»[]
有没有人知道如何为此目的重写以下功能:
Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
If text Is Nothing OrElse text.Length = 0 Then
Return Nothing
End If
Dim grams As New ArrayList()
Dim length As Integer = text.Length
If length < gramLength Then
Dim gram As String
For i As Integer = 1 To length
gram = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
gram = text.Substring(length - 1, (length) - (length - 1))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Else
For i As Integer = 1 To gramLength - 1
Dim gram As String = text.Substring(0, (i) - (0))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = 0 To (length - gramLength)
Dim gram As String = text.Substring(i, (i + gramLength) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
For i As Integer = (length - gramLength) + 1 To length - 1
Dim gram As String = text.Substring(i, (length) - (i))
If grams.IndexOf(gram) = -1 Then
grams.Add(gram)
End If
Next
End If
Return Tokeniser.ArrayListToArray(grams)
End Function
答案 0 :(得分:2)
单词的 n -gram只是存储这些单词的长度 n 的列表。然后, n -grams列表就是单词列表的列表。如果要存储频率,则需要一个由这些 n -grams索引的字典。对于2克的特殊情况,你可以想象这样的事情:
Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)
For i As Integer = 0 To words.Length - 2
Dim ngram = New String() { words(i), words(i + 1) }
Dim oldValue As Integer = 0
frequencies.TryGetValue(ngram, oldValue)
frequencies(ngram) = oldValue + 1
Next
frequencies
现在应该包含一个字典,其中包含文本中包含的所有两个连续单词对,以及它们出现的频率(作为连续对)。
此代码需要ArrayComparer
类:
Public Class ArrayComparer(Of T)
Implements IEqualityComparer(Of T())
Private ReadOnly comparer As IEqualityComparer(Of T)
Public Sub New()
Me.New(EqualityComparer(Of T).Default)
End Sub
Public Sub New(ByVal comparer As IEqualityComparer(Of T))
Me.comparer = comparer
End Sub
Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
Implements IEqualityComparer(Of T()).Equals
System.Diagnostics.Debug.Assert(a.Length = b.Length)
For i As Integer = 0 to a.Length - 1
If Not comparer.Equals(a(i), b(i)) Then Return False
Next
Return True
End Function
Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
Implements IEqualityComparer(Of T()).GetHashCode
Dim hashCode As Integer = 17
For Each obj As T In arr
hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
Next
Return hashCode
End Function
End Class
不幸的是,这段代码不能在Mono上编译,因为VB编译器在查找泛型EqualityComparer
类时遇到问题。因此,我无法测试GetHashCode
实现是否按预期工作,但它应该没问题。
答案 1 :(得分:0)
非常感谢Konrad这个解决方案的开始!
我尝试了您的代码,得到了以下结果:
Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)
frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
我的第一个问题:3个最后一个密钥对不应该得到值2,因为它们在文本中被发现了两次吗?
第二:我进入n-gram方法的原因是我不想将字数(n)限制为特定长度。有没有办法制作一种动态方法,试图先找到最长的词组匹配,然后再降到最后一个2的wordcount?
我上面的示例查询的目标结果是:
---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------
Hatem Mostafa在codeproject.com上编写了类似的C ++方法:N-gram and Fast Pattern Extraction Algorithm
可悲的是,我不是C ++专家,也不知道如何转换这些代码,因为它包含了很多内存处理,而.Net则没有。这个例子的唯一问题是你必须指定最小的单词模式长度,我希望它从2到最大的动态是动态的。