Question

该文件的内容位于：http://pastebin.com/nAe9q9Kt（因为我在问题中不能有多个空行）

以下是我的崇高文字的截图。

enter image description here

SPACED INPUT EXAMPLE START

a

b


c

SPACED INPUT EXAMPLE END

您可以注意到大多数行begin with 0(zero), except the words ENGINEERS and DOESNT和are separated by single blank line and sometimes by double blank lines.

基本上我想要的是：

List(
  List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
  List("0IF IT", "0AINT BROKE", "0DONT FIX IT"),
  List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE"),
  List("0IT"),
  List("0HAVE", "0ENOUGH", "0FEATURES YET.")
)

我试着编写一个尾递归代码，但它最终运行得很好:)但是在一个巨大的文件（超过10K行）上运行需要太长时间（几分钟）

我想过使用Regex方法或通过Scala代码执行sed或awk等Unix命令来生成临时文件。我的猜测是它会比我目前的方法运行得更快。

有人可以帮我使用正则表达式吗？

这是我的尾递归Scala代码：

@scala.annotation.tailrec
  def inner(remainingLines: List[String], previousLineIsBlank: Boolean, frames: List[List[String]], frame: List[String]): List[List[String]] = {
    remainingLines match {
      case Nil => frame :: frames

      case line :: Nil if !previousLineIsBlank =>
        inner(
          remainingLines = Nil,
          previousLineIsBlank = false,
          frames = frame :: frames,
          frame = line :: frame)

      case line :: tail => {
        line match {
          case "" if previousLineIsBlank => // Current line is blank, previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = true,
              frames = frame :: frames,
              frame = List.empty[String])
          case "" if !previousLineIsBlank => // Current line is blank, previous line is not blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = true,
              frames = frames,
              frame = frame)
          case line if !line.startsWith("0") && previousLineIsBlank => // Current line is not blank and does not start with 0 (ENGINEER, DOESN'T), previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = frame)
          case line if previousLineIsBlank => // Current line is not blank and does starts with 0, previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = line :: frame)
          case line if !previousLineIsBlank => // Current line is not blank, previous line not is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = line :: frame)
          case line => sys.error("Unmatched case = " + line)
        }
      }
    }
  }

Answer 1

以下是awk的方法。您可能需要找到一种方法将其合并到scala代码中：

awk '
BEGIN { print "List(" }
/^0/ { 
    printf "  %s", "List("
    for(i = 1; i <= NF; i++) {
        printf "%s%s" ,q $i q,(i==NF?"":", ")
    } 
    print "),"
}
END { print ")" }' RS= FS='\n' q='"'  file

使用您的样本数据（来自pastebin）输出：

List(
  List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
  List("0IF IT", "0AINT BROKE,", "0DONT FIX IT."),
  List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE,"),
  List("0IT"),
  List("0HAVE", "0ENOUGH", "0FEATURES YET."),
)

Answer 2

使用awk

awk 'BEGIN{print "List(" }
{ s=/^[0-9]/?1:0;i=s?i:i+1}
  s{a[i]=a[i]==""?$0:a[i] OFS $0}
END{ for (j=1;j<=i;j++)
        if (a[j]!="")
          { gsub(/\|/,"\",\"",a[j])
            printf "  list(\"%s\")\n", a[j]
          }
     print ")"
    }' OFS="|" file

List(
  list("0MOST PEOPLE","0BELIEVE","0THAT")
  list("0IF IT","0AINT BROKE,","0DONT FIX IT.")
  list("0BELIEVE","0THAT","0IF","0IT AINT BROKE,")
  list("0IT")
  list("0HAVE","0ENOUGH","0FEATURES YET.")
)

解释

s=/^[0-9]/?1:0;i=s?i:i+1标记（s和i）用于检测新记录。
s{a[i]=a[i]==""?$0:a[i] OFS $0}将每条记录（由非numbmer起始行分隔）保存到数组a
END中的重置用于以期望格式打印出结果。
OFS="|"希望输入文件中没有char |，如果有，请将其更改为其他字符，例如@，＃等。

Answer 3

val source = """0MOST PEOPLE
0BELIEVE
0THAT


0IF IT
0AINT BROKE,
0DONT FIX IT.


ENGINEERS

0BELIEVE
0THAT
0IF
0IT AINT BROKE,


0IT

DOESNT

0HAVE
0ENOUGH
0FEATURES YET."""

val output = (for (s <- source.split("\n\n").toList) yield {   // split on empty lines
            s.split("\n").toList                      // split on new lines 
            .filter(_.headOption.getOrElse("")=='0')}  // get rid of entries not starting with '0'
    ).filter(!_.isEmpty)                              // get rid of possible empty blocks

//output formatted for readability
scala> output: List[List[String]] = List(List(0MOST PEOPLE, 0BELIEVE, 0THAT), 
                                         List(0IF IT, 0AINT BROKE,, 0DONT FIX IT.),
                                         List(0BELIEVE, 0THAT, 0IF, 0IT AINT BROKE,), 
                                         List(0IT), 
                                         List(0HAVE, 0ENOUGH, 0FEATURES YET.))

更新：如果你正在从文件中读取这些行，那么旧的命令式方法可能会运行得很好，特别是如果源文件很大：

import scala.collection.mutable.ListBuffer
val lb = ListBuffer[List[String]]()
val ml = ListBuffer[String]()
for (ll <- source.fromFile(<yourfile>)) {
    if (ll.isEmpty) { 
        if (!ml.isEmpty) lb += ml.toList 
        ml.clear
    } else if (ll(0)=='0') ml+=ll 
}
val output = lb.toList

Answer 4

我对Scala不太熟悉，但我认为这是你正在寻找的正则表达式：

([A-Z]+[A-Z ]*)

查看实际操作：http://regex101.com/r/gY8lX6

编辑：/ /在这种情况下，您需要做的就是在捕获组的开头添加零：

(0[A-Z]+[A-Z ]*)

从由空行分隔的文本文件创建列表[List [String]]（使用Regex）

4 个答案:

使用您的样本数据（来自pastebin）输出：

解释