vb.net中的N-gram函数 - >为单词创建克而不是字符

时间:2010-03-10 11:46:23

标签: vb.net text-mining n-gram

我最近发现了n-gram以及比较文本体中短语频率的可能性。现在我正在尝试创建一个vb.net应用程序,只需获取文本正文并返回最常用短语列表(其中n> = 2)。

我找到了一个如何从文本体生成n-gram的C#示例,所以我开始将代码转换为VB。问题是这个代码确实每个字符创建一克而不是每个字一个。我想用于单词的分隔符是:VbCrLf(新行),vbTab(制表符)和以下字符:!@#$%^& *()_ + - = {} | \:\“'? ¿/.,& LT;> '¡º×÷';«»[]

有没有人知道如何为此目的重写以下功能:

   Friend Shared Function GenerateNGrams(ByVal text As String, ByVal gramLength As Integer) As String()
    If text Is Nothing OrElse text.Length = 0 Then
        Return Nothing
    End If

    Dim grams As New ArrayList()
    Dim length As Integer = text.Length
    If length < gramLength Then
        Dim gram As String
        For i As Integer = 1 To length
            gram = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        gram = text.Substring(length - 1, (length) - (length - 1))
        If grams.IndexOf(gram) = -1 Then
            grams.Add(gram)

        End If
    Else
        For i As Integer = 1 To gramLength - 1
            Dim gram As String = text.Substring(0, (i) - (0))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)

            End If
        Next

        For i As Integer = 0 To (length - gramLength)
            Dim gram As String = text.Substring(i, (i + gramLength) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next

        For i As Integer = (length - gramLength) + 1 To length - 1
            Dim gram As String = text.Substring(i, (length) - (i))
            If grams.IndexOf(gram) = -1 Then
                grams.Add(gram)
            End If
        Next
    End If
    Return Tokeniser.ArrayListToArray(grams)
End Function

2 个答案:

答案 0 :(得分:2)

单词的 n -gram只是存储这些单词的长度 n 的列表。然后, n -grams列表就是单词列表的列表。如果要存储频率,则需要一个由这些 n -grams索引的字典。对于2克的特殊情况,你可以想象这样的事情:

Dim frequencies As New Dictionary(Of String(), Integer)(New ArrayComparer(Of String)())
Const separators as String = "!@#$%^&*()_+-={}|\:""'?¿/.,<>’¡º×÷‘;«»[] " & _
                             ControlChars.CrLf & ControlChars.Tab
Dim words = text.Split(separators.ToCharArray(), StringSplitOptions.RemoveEmptyEntries)

For i As Integer = 0 To words.Length - 2
    Dim ngram = New String() { words(i), words(i + 1) }
    Dim oldValue As Integer = 0
    frequencies.TryGetValue(ngram, oldValue)
    frequencies(ngram) = oldValue + 1
Next

frequencies现在应该包含一个字典,其中包含文本中包含的所有两个连续单词对,以及它们出现的频率(作为连续对)。

此代码需要ArrayComparer类:

Public Class ArrayComparer(Of T)
    Implements IEqualityComparer(Of T())

    Private ReadOnly comparer As IEqualityComparer(Of T)

    Public Sub New()
        Me.New(EqualityComparer(Of T).Default)
    End Sub

    Public Sub New(ByVal comparer As IEqualityComparer(Of T))
        Me.comparer = comparer
    End Sub

    Public Overloads Function Equals(ByVal a As T(), ByVal b As T()) As Boolean _
            Implements IEqualityComparer(Of T()).Equals
        System.Diagnostics.Debug.Assert(a.Length = b.Length)
        For i As Integer = 0 to a.Length - 1
            If Not comparer.Equals(a(i), b(i)) Then Return False
        Next

        Return True
    End Function

    Public Overloads Function GetHashCode(ByVal arr As T()) As Integer _
            Implements IEqualityComparer(Of T()).GetHashCode
        Dim hashCode As Integer = 17
        For Each obj As T In arr
            hashCode = ((hashCode << 5) - 1) Xor comparer.GetHashCode(obj)
        Next

        Return hashCode
    End Function
End Class

不幸的是,这段代码不能在Mono上编译,因为VB编译器在查找泛型EqualityComparer类时遇到问题。因此,我无法测试GetHashCode实现是否按预期工作,但它应该没问题。

答案 1 :(得分:0)

非常感谢Konrad这个解决方案的开始!

我尝试了您的代码,得到了以下结果:

Text = "Hello I am a test Also I am a test"
(I also included whitespace as a separator)

frequencies now has 9 items:
---------------------
Keys: "Hello", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------
Keys: "test", "Also"
Value: 1
---------------------
Keys: "Also", "I"
Value: 1
---------------------
Keys: "I", "am"
Value: 1
---------------------
Keys: "am", "a"
Value: 1
---------------------
Keys: "a", "test"
Value: 1
---------------------

我的第一个问题:3个最后一个密钥对不应该得到值2,因为它们在文本中被发现了两次吗?

第二:我进入n-gram方法的原因是我不想将字数(n)限制为特定长度。有没有办法制作一种动态方法,试图先找到最长的词组匹配,然后再降到最后一个2的wordcount?

我上面的示例查询的目标结果是:

---------------------
Match: "I am a test"
Frequency: 2
---------------------
Match: "I am a"
Frequency: 2
---------------------
Match: "am a test"
Frequency: 2
---------------------
Match: "I am"
Frequency: 2
---------------------
Match: "am a"
Frequency: 2
---------------------
Match: "a test"
Frequency: 2
---------------------

Hatem Mostafa在codeproject.com上编写了类似的C ++方法:N-gram and Fast Pattern Extraction Algorithm

可悲的是,我不是C ++专家,也不知道如何转换这些代码,因为它包含了很多内存处理,而.Net则没有。这个例子的唯一问题是你必须指定最小的单词模式长度,我希望它从2到最大的动态是动态的。