我有一个函数,该函数应该接受一个长字符串并将其分成字符串列表,其中每个列表元素都是文章的句子。我将通过在空间上分割,然后根据以点结尾的标记对分割后的元素进行分组来实现此目的:
def getSentences(article: String): List[String] = {
val separatedBySpace = article
.map((c: Char) => if (c == '\n') ' ' else c)
.split(" ")
val splitAt: List[Int] = Range(0, separatedBySpace.size)
.filter(i => endsWithDot(separatedBySpace(0))).toList
// TODO
}
我在空格上分隔了字符串,并且找到了要对列表进行分组的每个索引。但是,现在如何将separatedBySpace
变成基于splitAt
的句子列表?
其工作方式示例:
article = "I like donuts. I like cats."
result = List("I like donuts.", "I like cats.")
PS:是的,我现在将文章拆分为句子的算法存在缺陷,我只想做一个快速的天真方法来完成工作。
答案 0 :(得分:0)
我最终通过使用递归来解决了这个问题:
def getSentenceTokens(article: String): List[List[String]] = {
val separatedBySpace: List[String] = article
.replace('\n', ' ')
.replaceAll(" +", " ") // regex
.split(" ")
.toList
val splitAt: List[Int] = separatedBySpace.indices
.filter(i => ( i > 0 && endsWithDot(separatedBySpace(i - 1)) ) || i == 0)
.toList
groupBySentenceTokens(separatedBySpace, splitAt, List())
}
def groupBySentenceTokens(tokens: List[String], splitAt: List[Int], sentences: List[List[String]]): List[List[String]] = {
if (splitAt.size <= 1) {
if (splitAt.size == 1) {
sentences :+ tokens.slice(splitAt.head, tokens.size)
} else {
sentences
}
}
else groupBySentenceTokens(tokens, splitAt.tail, sentences :+ tokens.slice(splitAt.head, splitAt.tail.head))
}
答案 1 :(得分:0)
val s: String = """I like donuts. I like cats
This is amazing"""
s.split("\\.|\n").map(_.trim).toList
//result: List[String] = List("I like donuts", "I like cats", "This is amazing")
在句子中包括点:
val (a, b, _) = s.replace("\n", " ").split(" ")
.foldLeft((List.empty[String], List.empty[String], "")){
case ((temp, result, finalStr), word) =>
if (word.endsWith(".")) {
(List.empty[String], result ++ List(s"$finalStr${(temp ++ List(word)).mkString(" ")}"), "")
} else {
(temp ++ List(word), result, finalStr)
}
}
val result = b ++ List(a.mkString(" ").trim)
//result = List("I like donuts.", "I like cats.", "This is amazing")