我的网络服务中的一个功能接收不同语言的数据
我想编写一个函数来标识接收到的字符串中的字符所属的语言。
我已经找到一个阿拉伯语:
Public Function IsGenericArabic(ByVal Msg As String) As Boolean
Dim ch As Char
IsGenericArabic = False
For Each ch In Msg
Dim ch1 As Integer = CInt(AscW(ch))
If ch1 >= &H621 AndAlso ch1 <= &H64A Then
IsGenericArabic = True
Exit For
End If
Next
End Function
但我无法找到关于如何识别俄语或罗马尼亚语的任何内容。
任何帮助都将不胜感激。
答案 0 :(得分:2)
对于许多语言来说,比较单个字符无济于事。想想使用拉丁字母的所有语言。对于这些语言,您将不得不检测这种语言的单词。问题是找到最有可能出现在输入文本中的单词列表。全文搜索算法通常排除在大多数句子中出现的词语过于频繁,因此不够有选择性。这些是像“和”,“the”,“a”和“of”这样的词。这些单词的列表称为停用单词列表。但这正是我们在这里所需要的。查找要检测的所有语言的停用词列表(谷歌搜索帮助)。
然后算法看起来像这样(在伪代码中,即缺少一些细节):
Class LanguageInfo
Public Property LanguageCode As String
Public Property Words As HashSet(Of String)
End Class
Dim infoList = New List(Of LanguageInfo)()
'Prepare the language information
For Each language In { "rus", "rom", ... }
'Assuming one stop word per line
Dim stopWords() As String = File.ReadAllLines(language + ".txt")
Dim info = New LanguageInfo()
info.LanguageCode = language
info.Words = New HashSet(Of String)(stopWords)
infoList.Add(info)
Next
'Detect language of input
Dim bestLanguageGuess As String = ""
Dim maxWeight As Integer = 0
Dim inputWords() As String = SplitIntoSingleWords(input)
For Each info In infoList
Dim weight As Integer = 0
For Each w In inputWords
If info.Words.Contains(w) Then
weight = weight + 1
End If
Next
If weight > maxWeight Then
bestLanguageGuess = info.LanguageCode
maxWeight = weight
End If
Next
If maxWeight > 0 Then
bestLanguageGuess is the language we are looking for
End If