如果我有一些类似下面的句子,
"Hello everyone. This is a sentence."
我如何使用Swift来获取这样的数组
var words = ["Hello", "everyone", "This", "is", "a", "sentence"]
我还需要一种方法来记住每个单词在原始字符串中的位置,以及完全停止和逗号的位置。因此,如果我将这个单词数组放回同一个字符串中,它将会读取
"Hello everyone. This is a sentence."
谢谢!
答案 0 :(得分:3)
按字词划分的子字符串内置于Cocoa:
let s = "Hello everyone. This is a sentence."
var arr = [String]()
s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .ByWords) {
ss, r, r2, stop in
arr.append(ss)
}
// now arr is ["Hello", "everyone", "This", "is", "a", "sentence"]
(但你的其他规格对我来说没有任何意义,所以我省略了它。你可以看到如何确定单词的来源,因为两个范围r
和r2
告诉我大家都知道。)
答案 1 :(得分:1)
通过这种方式,您可以以结构化的方式解析句子,这样您就可以修改单词并能够在任何时候重建完整的句子。
这样做的方法是:
按顺序解析包含单词,标点符号和空格的标记列表中的句子。
将每个令牌包裹在&#34;令牌&#34;使访问更方便的类,具体来说,让您快速查询令牌是否为单词。
然后,只要您需要包含单词的列表,就可以使用isWord
进行过滤,并对其进行修改。
修改将反映在标记数组中。所以当你想再次构造完整的句子时,你只需要加入tokens数组。
实现:
let input = "Hello everyone. This is a sentence."
class Token {
var text: String
var isWord: Bool {
return !(text == " " || text == ".")
}
init(text: String) {
self.text = text
}
}
let options: NSLinguisticTaggerOptions = .OmitOther
let schemes = [NSLinguisticTagSchemeLexicalClass]
let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))
let range = NSMakeRange(0, (input as NSString).length)
tagger.string = input
var parts : [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
tagger.enumerateTagsInRange(
range,
scheme: NSLinguisticTagSchemeLexicalClass,
options: options) {
(tag, tokenRange, _, _) in
let token = (input as NSString).substringWithRange(tokenRange)
parts.append(token)
}
println(parts) //"[Hello, , everyone, ., , This, , is, , a, , sentence, .]"
let tokens = parts.map{Token(text: $0)} // wrap the parts in a data structure to handle data more conveniently
let words = tokens.filter{$0.isWord} // get the tokens with only words, if you need them separatedly.
words[1].text = "world" // manipulate a word - this will reflect in the stored sentence
let text = join("", tokens.map{$0.text}) // "Hello world. This is a sentence."
答案 2 :(得分:0)
基于@matt的想法,您可以使用 enumerateSubstringsInRange 构建数组,还可以在数组中添加标点符号作为单独的项目。然后还创建一个函数,可以在转到使用数组时使用字符集确定字符串是否是标点符号。 见下文:
func checkCharSet(part:String, cSet:NSCharacterSet) -> Bool{
let check = part.rangeOfCharacterFromSet(cSet)
return (check != nil) ? true : false
}
func isPunctuation(part:String) -> Bool{
let punctSet = NSCharacterSet.punctuationCharacterSet()
return checkCharSet(part, punctSet)
}
func isHexadecimal(part:String) -> Bool{
let hexadecimal = NSCharacterSet.alphanumericCharacterSet()
return checkCharSet(part, hexadecimal)
}
然后利用这两个函数,我们可以相应地构建我们的数组。
let s = "Hello everyone. This is a sentence."
var arr = [String]()
var part:String = ""
s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .ByComposedCharacterSequences) { (
ss, r, r2, stop) -> () in
if isHexadecimal(ss){
part += ss
}else{
arr.append(part)
arr.append(ss)
part = ""
}
}
打印并过滤结果:
let wordsOnly = arr.filter({isHexadecimal($0)})
let punctOnly = arr.filter({isPunctuation($0)})
println("\(arr)")
println("\(wordsOnly)")
println("\(punctOnly)")
答案 3 :(得分:-2)
@Ixx,我不认为您的Token类非常有用。以下是我要做的修改:
import Cocoa
let input = "Hello everyone. This is a sentence."
let options: NSLinguisticTaggerOptions = .OmitOther
let schemes = [NSLinguisticTagSchemeTokenType]
let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))
let range = NSMakeRange(0, (input as NSString).length)
tagger.string = input
var parts: [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
var words: [String] = []
tagger.enumerateTagsInRange(
range,
scheme: NSLinguisticTagSchemeTokenType,
options: options) {
(tag, tokenRange, _, _) in
let part = (input as NSString).substringWithRange(tokenRange)
parts.append(part)
if tag == "Word" {
words.append(part)
}
}
println(parts)
println(words)
--output:--
[Hello, , everyone, ., , This, , is, , a, , sentence, .]
[Hello, everyone, This, is, a, sentence]
或者,为了获得最大的灵活性,您可以将令牌类型添加到令牌类,并使用tokenType属性过滤令牌数组。对于方案NSLinguisticTagSchemeTokenType
,令牌类型是Word,标点符号,空格或其他 - 这是op需要知道的唯一令牌类型。
class Token {
var text: String
var tokenType: String
init(text: String, tokenType: String) {
self.text = text
self.tokenType = tokenType
}
}
let input = "Hello everyone. This is a sentence."
let options: NSLinguisticTaggerOptions = .OmitOther
let schemes = [NSLinguisticTagSchemeTokenType]
let tagger = NSLinguisticTagger(tagSchemes: schemes, options: Int(options.rawValue))
let range = NSMakeRange(0, (input as NSString).length)
tagger.string = input
var parts: [String] = [] // here we put all the parts including spaces and punctuation signs, such that we can reconstruct sentence at any time
//var words: [String] = []
var tokens: [Token] = []
tagger.enumerateTagsInRange(
range,
scheme: NSLinguisticTagSchemeTokenType,
options: options) {
(tag, tokenRange, _, _) in
let part = (input as NSString).substringWithRange(tokenRange)
parts.append(part)
tokens.append(
Token(text: part, tokenType: tag)
)
}
println(parts) //"[Hello, , everyone, ., , This, , is, , a, , sentence, .]"
let words = tokens.filter({$0.tokenType == "Word"})
println(
words.map({$0.text} //[Hello, everyone, This, is, a, sentence]
)