Question

我有一个包含英语和阿拉伯语的字符串。我正在使用API，这就是为什么我无法在其中设置指标的原因。

我想要得到的是：将阿拉伯文和英文分为两个部分。这是一个示例字符串：

“باسمكربيوضعتجنبي，وبكأرفعه，فإنأمسكتنفسيفارحمها，وإنأرسلتهافاحفظها，بماتحفظبهعبادكالصالحين。Bismika rabbee wadaAAtu janbee wabika arfaAAuh，FA-在amsakta nafsee farhamha，WA-在arsaltaha fahfathha比马tahfathu bihi AAibadakas -saliheen。我奉你的名我的主躺下，我奉你的名升起，所以如果你要接受我的灵魂，就应当怜悯它，如果你要归还我的灵魂，那么请以你这样做的方式保护它义人。”，

我找不到如何将其分为两部分，将阿拉伯语和英语分为两部分。

我想要什么：

所以可以有任何语言，我的问题是只取出英语或阿拉伯语，并在相应的字段中显示它们。

我该如何实现？

Answer 1

您可以使用Natural Language Tagger，即使两个脚本混合在一起也可以使用：

import NaturalLanguage

let str = "¿como? بداية start وسط middle начать средний конец نهاية end. 從中間開始. "

let tagger = NLTagger(tagSchemes: [.script])

tagger.string = str

var index = str.startIndex
var dictionary = [String: String]()
var lastScript = "other"


while index < str.endIndex {
    let res = tagger.tag(at: index, unit: .word, scheme: .script)
    let range = res.1

    let script = res.0?.rawValue

    switch script {
    case .some(let s):
        lastScript = s
        dictionary[s, default: ""] += dictionary["other", default: ""] + str[range]
        dictionary.removeValue(forKey: "other")
    default:
        dictionary[lastScript, default: ""] += str[range]
    }

    index = range.upperBound
}

print(dictionary)

并根据需要打印结果：

for entry in dictionary {
    print(entry.key, ":", entry.value)
}

屈服：

Hant : 從中間開始. 
Cyrl : начать средний конец 
Arab : بداية وسط نهاية 
Latn : ¿como? start middle end.

这还是不完美的，因为语言标记器仅检查 word 中字母数量最多的字母属于哪个脚本。例如，在您使用的字符串中，标记器会将الصَّالِحِينَ.Bismika视为一个单词。为了克服这个问题，我们可以使用两个指针并遍历原始字符串，然后分别检查 words 的脚本。单词被定义为连续的字母：

let str = "بِاسْمِكَ رَبِّي وَضَعْتُ جَنْبِي، وَبِكَ أَرْفَعُهُ، فَإِنْ أَمْسَكْتَ نَفْسِي فَارْحَمْهَا، وَإِنْ أَرْسَلْتَهَا فَاحْفَظْهَا، بِمَا تَحْفَظُ بِهِ عِبَادَكَ الصَّالِحِينَ.Bismika rabbee wadaAAtu janbee wabika arfaAAuh, fa-in amsakta nafsee farhamha, wa-in arsaltaha fahfathha bima tahfathu bihi AAibadakas-saliheen. In Your name my Lord, I lie down and in Your name I rise, so if You should take my soul then have mercy upon it, and if You should return my soul then protect it in the manner You do so with Your righteous servants."

let tagger = NLTagger(tagSchemes: [.script])
var i = str.startIndex
var dictionary = [String: String]()
var lastScript = "glyphs"

while i < str.endIndex {
    var j = i
    while j < str.endIndex,
        CharacterSet.letters.inverted.isSuperset(of: CharacterSet(charactersIn: String(str[j]))) {
        j = str.index(after: j)
    }
    if i != j { dictionary[lastScript, default: ""] += str[i..<j] }
    if j < str.endIndex { i = j } else { break }

    while j < str.endIndex,
        CharacterSet.letters.isSuperset(of: CharacterSet(charactersIn: String(str[j]))) {
        j = str.index(after: j)
    }

    let tempo = String(str[i..<j])
    tagger.string = tempo
    let res = tagger.tag(at: tempo.startIndex, unit: .word, scheme: .script)

    if let s = res.0?.rawValue {
        lastScript = s
        dictionary[s, default: ""] += dictionary["glyphs", default: ""] + tempo
        dictionary.removeValue(forKey: "glyphs")
    }
    else { dictionary["other", default: ""] += tempo }

    i = j
}

Answer 2

Step 1: You have to split whole string into an array by "." as I can see there are "." between sentence.

Step 2: Pass each sentence to determine its language and append into different string.

Final Code

//add in your viewController

enum Language : String {
case arabic = "ar"
case english = "en"
}

override func viewDidLoad() {
    super.viewDidLoad()
    //make array of string
    let kalmaArray = "بِاسْمِكَ رَبِّي وَضَعْتُ جَنْبِي، وَبِكَ أَرْفَعُهُ، فَإِنْ أَمْسَكْتَ نَفْسِي فَارْحَمْهَا، وَإِنْ أَرْسَلْتَهَا فَاحْفَظْهَا، بِمَا تَحْفَظُ بِهِ عِبَادَكَ الصَّالِحِينَ.Bismika rabbee wadaAAtu janbee wabika arfaAAuh, fa-in amsakta nafsee farhamha, wa-in arsaltaha fahfathha bima tahfathu bihi AAibadakas-saliheen. In Your name my Lord, I lie down and in Your name I rise, so if You should take my soul then have mercy upon it, and if You should return my soul then protect it in the manner You do so with Your righteous servants.".components(separatedBy: ".")

    splitInLanguages(kalmaArray: kalmaArray)

}



private func splitInLanguages(kalmaArray: [String]){
    var englishText = ""
    var arabicText = ""

    for kalma in kalmaArray {

        if kalma.count > 0 {

            if let language = NSLinguisticTagger.dominantLanguage(for: kalma) {
                switch language {
                case Language.arabic.rawValue:
                    arabicText.append(kalma)
                    arabicText.append(".")
                    break
                default: // English
                    englishText.append(kalma)
                    englishText.append(".")
                    break
                }
            } else {
                print("Unknown language")
            }
        }
    }

    debugPrint("Arabic: ", arabicText)
    debugPrint("English: ", englishText)
}

I hope it will help you to split the string in two language. Let me know if you are still having any issue.

Answer 3

您可以使用@ielyamani回答的NaturalLanguageTagger，但唯一的限制是它是iOS 12+

如果您尝试在较早的iOS版本上执行此操作，则可以查看NSCharacterSet

您可以创建自己的characterset来检查字符串是否包含英文字符和数字

extension String {

     func containsLatinCharacters() -> Bool {

        var charSet = NSCharacterSet(charactersInString: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
        charSet = charSet.invertedSet

        let range = (self as NSString).rangeOfCharacterFromSet(charSet)

        if range.location != NSNotFound {
            return false
        }

        return true
    }
}

另一种选择是使用已经可用的字符集：

let nonLatinString = string.trimmingCharacters(in: .alphanumerics)//symbols will still get through
let latinString = string.trimmingCharacters(in: CharacterSet.alphanumerics.inverted)//symbols and non-latin characters wont get through

有了这些，您可以轻松获得想要的字符串。但是，如果这些还不够好，您可以尝试创建自己的字符集，使用union, intersect等过滤掉所需和不需要的字符。

如何使用Swift 4将字符串拆分为英语和非英语？

3 个答案: