在VB.net中识别String中的字符语言

时间:2014-10-13 11:55:11

标签: vb.net

我的网络服务中的一个功能接收不同语言的数据

  1. 阿拉伯
  2. 我想编写一个函数来标识接收到的字符串中的字符所属的语言。

    我已经找到一个阿拉伯语:

    Public Function IsGenericArabic(ByVal Msg As String) As Boolean
        Dim ch As Char
        IsGenericArabic = False
        For Each ch In Msg
            Dim ch1 As Integer = CInt(AscW(ch))
            If ch1 >= &H621 AndAlso ch1 <= &H64A Then
                IsGenericArabic = True
                Exit For
            End If
        Next
    End Function
    

    但我无法找到关于如何识别俄语罗马尼亚语的任何内容。

    任何帮助都将不胜感激。

1 个答案:

答案 0 :(得分:2)

对于许多语言来说,比较单个字符无济于事。想想使用拉丁字母的所有语言。对于这些语言,您将不得不检测这种语言的单词。问题是找到最有可能出现在输入文本中的单词列表。全文搜索算法通常排除在大多数句子中出现的词语过于频繁,因此不够有选择性。这些是像“和”,“the”,“a”和“of”这样的词。这些单词的列表称为停用单词列表。但这正是我们在这里所需要的。查找要检测的所有语言的停用词列表(谷歌搜索帮助)。

然后算法看起来像这样(在伪代码中,即缺少一些细节):

Class LanguageInfo
    Public Property LanguageCode As String
    Public Property Words As HashSet(Of String)
End Class

Dim infoList = New List(Of LanguageInfo)()

'Prepare the language information
For Each language In { "rus", "rom", ... }
    'Assuming one stop word per line
    Dim stopWords() As String = File.ReadAllLines(language + ".txt")
    Dim info = New LanguageInfo()
    info.LanguageCode = language
    info.Words = New HashSet(Of String)(stopWords)
    infoList.Add(info)
Next

'Detect language of input
Dim bestLanguageGuess As String = ""
Dim maxWeight As Integer = 0
Dim inputWords() As String = SplitIntoSingleWords(input)
For Each info In infoList
    Dim weight As Integer = 0
    For Each w In  inputWords
        If info.Words.Contains(w) Then
            weight = weight + 1
        End If
    Next
    If weight > maxWeight Then
        bestLanguageGuess = info.LanguageCode
        maxWeight = weight
    End If
Next
If maxWeight > 0 Then
    bestLanguageGuess is the language we are looking for
End If