如何从Swift中的段落中获取一系列单词?

时间:2015-07-26 02:44:51

标签: arrays string swift words

如果我有一些类似下面的句子,

"Hello everyone. This is a sentence."

我如何使用Swift来获取这样的数组

var words = ["Hello", "everyone", "This", "is", "a", "sentence"]

我还需要一种方法来记住每个单词在原始字符串中的位置,以及完全停止和逗号的位置。因此,如果我将这个单词数组放回同一个字符串中,它将会读取

"Hello everyone. This is a sentence."

谢谢!

4 个答案:

答案 0 :(得分:3)

按字词划分的子字符串内置于Cocoa:

let s = "Hello everyone. This is a sentence."
var arr = [String]()
s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .ByWords) { 
    ss, r, r2, stop in
    arr.append(ss)
}
// now arr is ["Hello", "everyone", "This", "is", "a", "sentence"]

(但你的其他规格对我来说没有任何意义,所以我省略了它。你可以看到如何确定单词的来源,因为两个范围rr2告诉我大家都知道。)

答案 1 :(得分:1)

通过这种方式,您可以以结构化的方式解析句子,这样您就可以修改单词并能够在任何时候重建完整的句子。

这样做的方法是:

  1. 按顺序解析包含单词,标点符号和空格的标记列表中的句子。

  2. 将每个令牌包裹在&#34;令牌&#34;使访问更方便的类,具体来说,让您快速查询令牌是否为单词。

  3. 然后,只要您需要包含单词的列表,就可以使用isWord进行过滤,并对其进行修改。

  4. 修改将反映在标记数组中。所以当你想再次构造完整的句子时,你只需要加入tokens数组。

  5. 实现:

    let input = "Hello everyone. This is a sentence."
    
    class Token {
        var text: String
        var isWord: Bool {
            return !(text == " " || text == ".")
        }
        init(text: String) {
            self.text = text
        }
    }
    
    let options: NSLinguisticTaggerOptions = .OmitOther
    let schemes = [NSLinguisticTagSchemeLexicalClass]
    let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))
    let range = NSMakeRange(0, (input as NSString).length)
    tagger.string = input
    var parts : [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
    tagger.enumerateTagsInRange(
        range,
        scheme: NSLinguisticTagSchemeLexicalClass,
        options: options) {
            (tag, tokenRange, _, _) in
    
            let token = (input as NSString).substringWithRange(tokenRange)
            parts.append(token)
    }
    
    println(parts) //"[Hello,  , everyone, .,  , This,  , is,  , a,  , sentence, .]"
    let tokens = parts.map{Token(text: $0)} // wrap the parts in a data structure to handle data more conveniently
    let words = tokens.filter{$0.isWord} // get the tokens with only words, if you need them separatedly.
    words[1].text = "world" // manipulate a word - this will reflect in the stored sentence
    let text = join("", tokens.map{$0.text}) // "Hello world. This is a sentence."
    

答案 2 :(得分:0)

基于@matt的想法,您可以使用 enumerateSubstringsInRange 构建数组,还可以在数组中添加标点符号作为单独的项目。然后还创建一个函数,可以在转到使用数组时使用字符集确定字符串是否是标点符号。 见下文:

func checkCharSet(part:String, cSet:NSCharacterSet) -> Bool{
    let check = part.rangeOfCharacterFromSet(cSet)
    return (check != nil) ? true : false
}

func isPunctuation(part:String) -> Bool{
    let punctSet = NSCharacterSet.punctuationCharacterSet()
    return checkCharSet(part, punctSet)
}    

func isHexadecimal(part:String) -> Bool{
    let hexadecimal = NSCharacterSet.alphanumericCharacterSet()
    return checkCharSet(part, hexadecimal)
}

然后利用这两个函数,我们可以相应地构建我们的数组。

let s = "Hello everyone. This is a sentence."

var arr = [String]()
var part:String = ""


s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .ByComposedCharacterSequences) { (
    ss, r, r2, stop) -> () in

    if isHexadecimal(ss){
        part += ss
    }else{
        arr.append(part)
        arr.append(ss)
        part = ""
    }
}

打印并过滤结果:

let wordsOnly = arr.filter({isHexadecimal($0)})
let punctOnly = arr.filter({isPunctuation($0)})

println("\(arr)")
println("\(wordsOnly)")
println("\(punctOnly)")

http://www.swiftstub.com/873988306/?v=gm

答案 3 :(得分:-2)

@Ixx,我不认为您的Token类非常有用。以下是我要做的修改:

import Cocoa

let input = "Hello everyone. This is a sentence."

let options: NSLinguisticTaggerOptions = .OmitOther
let schemes = [NSLinguisticTagSchemeTokenType]
let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))

let range = NSMakeRange(0, (input as NSString).length)
tagger.string = input

var parts: [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
var words: [String] = []

tagger.enumerateTagsInRange(
    range,
    scheme: NSLinguisticTagSchemeTokenType,
    options: options) {

    (tag, tokenRange, _, _) in

        let part = (input as NSString).substringWithRange(tokenRange)
        parts.append(part)

        if tag == "Word" {
            words.append(part)
        }
}


println(parts) 
println(words)

--output:--
[Hello,  , everyone, .,  , This,  , is,  , a,  , sentence, .]
[Hello, everyone, This, is, a, sentence]

或者,为了获得最大的灵活性,您可以将令牌类型添加到令牌类,并使用tokenType属性过滤令牌数组。对于方案NSLinguisticTagSchemeTokenType,令牌类型是Word,标点符号,空格或其他 - 这是op需要知道的唯一令牌类型。

class Token {
    var text: String
    var tokenType: String

    init(text: String, tokenType: String) {
        self.text = text
        self.tokenType = tokenType
    }
}

let input = "Hello everyone. This is a sentence."

let options: NSLinguisticTaggerOptions = .OmitOther
let schemes = [NSLinguisticTagSchemeTokenType]
let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))

let range = NSMakeRange(0, (input as NSString).length)
tagger.string = input

var parts: [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
//var words: [String] = []
var tokens: [Token] = []

tagger.enumerateTagsInRange(
    range,
    scheme: NSLinguisticTagSchemeTokenType,
    options: options) {

    (tag, tokenRange, _, _) in

        let part = (input as NSString).substringWithRange(tokenRange)
        parts.append(part)

        tokens.append(
            Token(text: part, tokenType: tag)
        )
}


println(parts) //"[Hello,  , everyone, .,  , This,  , is,  , a,  , sentence, .]"

let words = tokens.filter({$0.tokenType == "Word"})

println(
    words.map({$0.text}  //[Hello, everyone, This, is, a, sentence]
)