我有很多文字。例如
我想将一个段落分成句子。但有个问题。我的段落包括2014年1月13日之类的日期,像U.A.E这样的字样和2.2之类的数字。我该如何拆分。**
输出:
I want to split a paragraph into sentences.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.
How do i split this.
这是我想要的输出。任何人都可以指导我在Swift中这样做。
感谢。
答案 0 :(得分:6)
使用NSLinguisticTagger。它可以为您的输入提供正确的句子,因为它可以用实际的语言术语进行分析。
这是一个粗略的草案(Swift 1.2,这不会在Swift 2.0中编译):
let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
indices(s), scheme: NSLinguisticTagSchemeLexicalClass,
options: nil, tokenRanges: &r)
var result = [String]()
let ixs = Array(enumerate(t)).filter {
$0.1 == "SentenceTerminator"
}.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
let r = prev...ix
result.append(
s[r].stringByTrimmingCharactersInSet(
NSCharacterSet.whitespaceCharacterSet()))
prev = advance(ix,1)
}
这是一个Swift 2.0版本(更新到Xcode 7 beta 6):
let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTagsInRange(
s.characters.indices, scheme: NSLinguisticTagSchemeLexicalClass,
tokenRanges: &r)
var result = [String]()
let ixs = t.enumerate().filter {
$0.1 == "SentenceTerminator"
}.map {r[$0.0].startIndex}
var prev = s.startIndex
for ix in ixs {
let r = prev...ix
result.append(
s[r].stringByTrimmingCharactersInSet(
NSCharacterSet.whitespaceCharacterSet()))
prev = ix.advancedBy(1)
}
这里更新了Swift 3:
let s = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
var r = [Range<String.Index>]()
let t = s.linguisticTags(
in: s.startIndex..<s.endIndex,
scheme: NSLinguisticTagSchemeLexicalClass,
tokenRanges: &r)
var result = [String]()
let ixs = t.enumerated().filter {
$0.1 == "SentenceTerminator"
}.map {r[$0.0].lowerBound}
var prev = s.startIndex
for ix in ixs {
let r = prev...ix
result.append(
s[r].trimmingCharacters(
in: NSCharacterSet.whitespaces))
prev = s.index(after: ix)
}
result
是一个包含四个字符串的数组,每个字符串一个句子:
["I want to split a paragraph into sentences.",
"But, there is a problem.",
"My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.",
"How do i split this."]
答案 1 :(得分:0)
这是我相信您正在寻找的粗略版本: 我在角色中循环寻找&#34;的组合。 &#34;
循环运行时,字符会添加到currentSentence String?
。找到组合后,currentSentence
会添加到sentences[sentenceNumber]
。
此外,必须捕获2个异常,第一次循环在迭代2上为period == index-1
。第二个是最后一句话,因为在这段时间之后没有空格。
var paragraph = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E abd numbers like 2.2. How do I split this."
var sentences = [String]()
var sentenceNumber = 0
var currentSentence: String? = ""
var charArray = paragraph.characters
var period = 0
for (index, char) in charArray.enumerate() {
currentSentence! += "\(char)"
if (char == ".") {
period = index
if (period == charArray.count-1) {
sentences.append(currentSentence!)
}
} else if ((char == " " && period == index-1 && index != 1) || period == (charArray.count-1)) {
sentences.append(currentSentence!)
print(period)
currentSentence = ""
sentenceNumber++
}
}
答案 2 :(得分:0)
这是迅速4中的无聊答案
func splitsentance(string: String) -> [String]{
let s = string
var r = [Range<String.Index>]()
let t = s.linguisticTags(
in: s.startIndex..<s.endIndex, scheme: NSLinguisticTagScheme.lexicalClass.rawValue,
options: [], tokenRanges: &r)
var result = [String]()
let ixs = t.enumerated().filter{
$0.1 == "SentenceTerminator"
}.map {r[$0.0].lowerBound}
var prev = s.startIndex
for ix in ixs {
let r = prev...ix
result.append(
s[r].trimmingCharacters(in: CharacterSet.whitespacesAndNewlines))
prev = ix
}
return result
}
答案 3 :(得分:0)
通过语言标记枚举感觉就像是处理此任务的有效方法。 我们可以消除存储多余st的开销。
Enum.reverse/1
答案 4 :(得分:0)
NSLinguisticTagger
已过时。改用NLTagger
。 (iOS 12.0 +,macOS 10.14 +)
import NaturalLanguage
var str = "I want to split a paragraph into sentences. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2. How do i split this."
func splitSentenceFrom(text: String) -> [String] {
var result: [String] = []
let tagger = NLTagger(tagSchemes: [.lexicalClass])
tagger.string = text
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .sentence, scheme: .lexicalClass) { (tag, tokenRange) -> Bool in
result.append(String(text[tokenRange]))
return true
}
return result
}
let sentences = splitSentenceFrom(text: str)
sentences.forEach {
print($0)
}
输出:
I want to split a paragraph into sentences.
But, there is a problem.
My paragraph includes dates like Jan.13, 2014 , words like U.A.E and numbers like 2.2.
How do i split this.
是否要排除空白句子并修剪空白?添加
let sentence = String(text[tokenRange]).trimmingCharacters(in: .whitespacesAndNewlines)
if sentence.count > 0 {
result.append(sentence)
}