将文本行拆分为单词,并根据投票决定哪一个是正确的

时间:2018-02-28 12:39:16

标签: vb.net text-parsing text-extraction

以下代码将每行拆分为单词,并将每行中的第一个单词存储到数组列表中,将第二个单词存储到另一个数组列表中,依此类推。然后,它从每个列表中选择最常用的单词作为正确的单词。

Module Module1

Sub Main()
    Dim correctLine As String = ""
    Dim line1 As String = "Canda has more than ones official language"
    Dim line2 As String = "Canada has more than one oficial languages"
    Dim line3 As String = "Canada has nore than one official lnguage"
    Dim line4 As String = "Canada has nore than one offical language"

    Dim wordsOfLine1() As String = line1.Split(" ")
    Dim wordsOfLine2() As String = line2.Split(" ")
    Dim wordsOfLine3() As String = line3.Split(" ")
    Dim wordsOfLine4() As String = line4.Split(" ")


    For i As Integer = 0 To wordsOfLine1.Length - 1
        Dim wordAllLinesTemp As New List(Of String)(New String() {wordsOfLine1(i), wordsOfLine2(i), wordsOfLine3(i), wordsOfLine4(i)})
        Dim counts = From n In wordAllLinesTemp
        Group n By n Into Group
        Order By Group.Count() Descending
        Select Group.First
        correctLine = correctLine & counts.First & " "
    Next
    correctLine = correctLine.Remove(correctLine.Length - 1)
    Console.WriteLine(correctLine)
    Console.ReadKey()

End Sub
End Module

我的问题:如何使用不同数量的单词行。我的意思是这里每行的长度是7个单词,for循环使用这个长度(长度为1)。假设第3行包含5个单词。

1 个答案:

答案 0 :(得分:0)

编辑:无意中有正确的索引应该是最短的。

据我所知,你试图看哪条线最接近正确的线。

您可以使用以下代码获取levenshtein距离:

Formatter fmt = new Formatter();
double value = Double.parseDouble(fmt.format("%.2f", x).toString());

然后,这将用于确定哪条线最接近:

Public Function LevDist(ByVal s As String,
                                ByVal t As String) As Integer
    Dim n As Integer = s.Length
    Dim m As Integer = t.Length
    Dim d(n + 1, m + 1) As Integer

    If n = 0 Then
        Return m
    End If

    If m = 0 Then
        Return n
    End If

    Dim i As Integer
    Dim j As Integer

    For i = 0 To n
        d(i, 0) = i
    Next

    For j = 0 To m
        d(0, j) = j
    Next

    For i = 1 To n
        For j = 1 To m

            Dim cost As Integer
            If t(j - 1) = s(i - 1) Then
                cost = 0
            Else
                cost = 1
            End If

            d(i, j) = Math.Min(Math.Min(d(i - 1, j) + 1, d(i, j - 1) + 1),
                               d(i - 1, j - 1) + cost)
        Next
    Next

    Return d(n, m)
End Function