Question

我有一个函数，该函数应该接受一个长字符串并将其分成字符串列表，其中每个列表元素都是文章的句子。我将通过在空间上分割，然后根据以点结尾的标记对分割后的元素进行分组来实现此目的：

  def getSentences(article: String): List[String] = {
    val separatedBySpace = article
      .map((c: Char) => if (c == '\n') ' ' else c)
      .split(" ")

    val splitAt: List[Int] = Range(0, separatedBySpace.size)
      .filter(i => endsWithDot(separatedBySpace(0))).toList

    // TODO
  }

我在空格上分隔了字符串，并且找到了要对列表进行分组的每个索引。但是，现在如何将separatedBySpace变成基于splitAt的句子列表？

其工作方式示例：

article = "I like donuts. I like cats."
result = List("I like donuts.", "I like cats.")

PS：是的，我现在将文章拆分为句子的算法存在缺陷，我只想做一个快速的天真方法来完成工作。

Answer 1

我最终通过使用递归来解决了这个问题：

  def getSentenceTokens(article: String): List[List[String]] = {
    val separatedBySpace: List[String] = article
      .replace('\n', ' ')
      .replaceAll(" +", " ") // regex
      .split(" ")
      .toList

    val splitAt: List[Int] = separatedBySpace.indices
      .filter(i => ( i > 0 && endsWithDot(separatedBySpace(i - 1)) ) || i == 0)
      .toList

    groupBySentenceTokens(separatedBySpace, splitAt, List())
  }

  def groupBySentenceTokens(tokens: List[String], splitAt: List[Int], sentences: List[List[String]]): List[List[String]] = {
    if (splitAt.size <= 1) {
      if (splitAt.size == 1) {
        sentences :+ tokens.slice(splitAt.head, tokens.size)
      } else {
        sentences
      }
    }
    else groupBySentenceTokens(tokens, splitAt.tail, sentences :+ tokens.slice(splitAt.head, splitAt.tail.head))
  }

Answer 2

val s: String = """I like donuts. I like cats
                   This is amazing"""

s.split("\\.|\n").map(_.trim).toList
//result: List[String] = List("I like donuts", "I like cats", "This is amazing")

在句子中包括点：

val (a, b, _) = s.replace("\n", " ").split(" ")
                 .foldLeft((List.empty[String], List.empty[String], "")){

    case ((temp, result, finalStr), word) => 
        if (word.endsWith(".")) {
            (List.empty[String], result ++ List(s"$finalStr${(temp ++ List(word)).mkString(" ")}"), "")
        } else {
            (temp ++ List(word), result, finalStr)
        }
}

val result = b ++ List(a.mkString(" ").trim)
//result = List("I like donuts.", "I like cats.", "This is amazing")

如何基于索引列表将Scala列表拆分为子列表

2 个答案: