拆分字符串并获取每个段的起始索引

时间:2018-06-13 16:55:29

标签: scala indexing split

我正在尝试拆分String并获取每个“单词”的所有起始索引。

例如对于这样的字符串:

"Rabbit jumped over a fence and this Rabbit loves carrots"

如何拆分它以获得每个单词的索引?:

0,7,14,19,21,27,31,36,43,49

4 个答案:

答案 0 :(得分:5)

你可以这样做

val str="Rabbit jumped over a fence and this Rabbit loves carrots"
val indexArr=str.split(" ").scanLeft(0)((prev,next)=>prev+next.length+1).dropRight(1)

示例输出:

ndexArr: Array[Int] = Array(0, 7, 14, 19, 21, 27, 31, 36, 43, 49)

答案 1 :(得分:3)

这个解决方案即使分隔符的宽度不恒定也是有效的(不仅适用于长度为1的分隔符)。

  1. 使用前瞻和后瞻FOO的组合,而不是单个分隔符(?<=FOO)|(?=FOO)
  2. 扫描标记和分隔符的长度,累积它们的长度以获得开始索引
  3. 扔掉每隔一个条目(分隔符)
  4. 在代码中:

    val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"
    val pieces = txt.split("(?= )|(?<= )")
    val startIndices = pieces.scanLeft(0){ (acc, w) => acc + w.size }
    val tokensWithStartIndices = (pieces zip startIndices).grouped(2).map(_.head)
    
    tokensWithStartIndices foreach println
    

    结果:

    (Rabbit,0)
    (jumped,7)
    (over,14)
    (a,19)
    (fence,21)
    (and,27)
    (this,31)
    (Rabbit,36)
    (loves,43)
    (carrots,49)
    

    以下是一些中间输出,因此您可以更好地了解每个步骤中发生的事情:

    scala> val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"
    txt: String = Rabbit jumped over a fence and this Rabbit loves carrots
    
    scala> val pieces = txt.split("(?= )|(?<= )")
    pieces: Array[String] = Array(Rabbit, " ", jumped, " ", over, " ", a, " ", fence, " ", and, " ", this, " ", Rabbit, " ", loves, " ", carrots)
    
    scala> val startIndices = pieces.scanLeft(0){ (acc, w) => acc + w.size }
    startIndices: Array[Int] = Array(0, 6, 7, 13, 14, 18, 19, 20, 21, 26, 27, 30, 31, 35, 36, 42, 43, 48, 49, 56)
    

答案 2 :(得分:2)

即使行以空格开头,或者您有多个空格或制表符分隔某些单词,这也应该是准确的。它遍历String,注意从任何空白字符(空格,制表符,换行符等)到非空格字符的转换。

val txt = "Rabbit jumped over a fence and this Rabbit loves carrots"

txt.zipWithIndex.foldLeft((Seq.empty[Int],true)){case ((s,b),(c,i)) =>
    if (c.isWhitespace) (s,true)
    else if (b) (s :+ i, false)
    else (s,false)
}._1

答案 3 :(得分:1)

以下是另一种混合zipWithIndexcollect

0 :: str.zipWithIndex.collect { case (' ', i) => i + 1 }.toList

预先填写第一个单词的索引并不是非常优雅,它只允许使用长度为1的分隔符;但它的体积极小且易读。