从由空行分隔的文本文件创建列表[List [String]](使用Regex)

时间:2014-03-13 22:01:58

标签: regex scala sed awk

该文件的内容位于:http://pastebin.com/nAe9q9Kt(因为我在问题中不能有多个空行)

以下是我的崇高文字的截图。

enter image description here

SPACED INPUT EXAMPLE START

a

b


c

SPACED INPUT EXAMPLE END

您可以注意到大多数行begin with 0(zero), except the words ENGINEERS and DOESNTare separated by single blank line and sometimes by double blank lines.

基本上我想要的是:

List(
  List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
  List("0IF IT", "0AINT BROKE", "0DONT FIX IT"),
  List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE"),
  List("0IT"),
  List("0HAVE", "0ENOUGH", "0FEATURES YET.")
)

我试着编写一个尾递归代码,但它最终运行得很好:)但是在一个巨大的文件(超过10K行)上运行需要太长时间(几分钟)

我想过使用Regex方法或通过Scala代码执行sed或awk等Unix命令来生成临时文件。我的猜测是它会比我目前的方法运行得更快。

有人可以帮我使用正则表达式吗?

这是我的尾递归Scala代码:

@scala.annotation.tailrec
  def inner(remainingLines: List[String], previousLineIsBlank: Boolean, frames: List[List[String]], frame: List[String]): List[List[String]] = {
    remainingLines match {
      case Nil => frame :: frames

      case line :: Nil if !previousLineIsBlank =>
        inner(
          remainingLines = Nil,
          previousLineIsBlank = false,
          frames = frame :: frames,
          frame = line :: frame)

      case line :: tail => {
        line match {
          case "" if previousLineIsBlank => // Current line is blank, previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = true,
              frames = frame :: frames,
              frame = List.empty[String])
          case "" if !previousLineIsBlank => // Current line is blank, previous line is not blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = true,
              frames = frames,
              frame = frame)
          case line if !line.startsWith("0") && previousLineIsBlank => // Current line is not blank and does not start with 0 (ENGINEER, DOESN'T), previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = frame)
          case line if previousLineIsBlank => // Current line is not blank and does starts with 0, previous line is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = line :: frame)
          case line if !previousLineIsBlank => // Current line is not blank, previous line not is blank
            inner(
              remainingLines = tail,
              previousLineIsBlank = false,
              frames = frames,
              frame = line :: frame)
          case line => sys.error("Unmatched case = " + line)
        }
      }
    }
  }

4 个答案:

答案 0 :(得分:1)

以下是awk的方法。您可能需要找到一种方法将其合并到scala代码中:

awk '
BEGIN { print "List(" }
/^0/ { 
    printf "  %s", "List("
    for(i = 1; i <= NF; i++) {
        printf "%s%s" ,q $i q,(i==NF?"":", ")
    } 
    print "),"
}
END { print ")" }' RS= FS='\n' q='"'  file

使用您的样本数据(来自pastebin)输出:

List(
  List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
  List("0IF IT", "0AINT BROKE,", "0DONT FIX IT."),
  List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE,"),
  List("0IT"),
  List("0HAVE", "0ENOUGH", "0FEATURES YET."),
)

答案 1 :(得分:1)

使用awk

awk 'BEGIN{print "List(" }
{ s=/^[0-9]/?1:0;i=s?i:i+1}
  s{a[i]=a[i]==""?$0:a[i] OFS $0}
END{ for (j=1;j<=i;j++)
        if (a[j]!="")
          { gsub(/\|/,"\",\"",a[j])
            printf "  list(\"%s\")\n", a[j]
          }
     print ")"
    }' OFS="|" file

List(
  list("0MOST PEOPLE","0BELIEVE","0THAT")
  list("0IF IT","0AINT BROKE,","0DONT FIX IT.")
  list("0BELIEVE","0THAT","0IF","0IT AINT BROKE,")
  list("0IT")
  list("0HAVE","0ENOUGH","0FEATURES YET.")
)

解释

  • s=/^[0-9]/?1:0;i=s?i:i+1标记(s和i)用于检测新记录。
  • s{a[i]=a[i]==""?$0:a[i] OFS $0}将每条记录(由非numbmer起始行分隔)保存到数组a
  • END中的重置用于以期望格式打印出结果。
  • OFS="|"希望输入文件中没有char |,如果有,请将其更改为其他字符,例如@,#等。

答案 2 :(得分:1)

val source = """0MOST PEOPLE
0BELIEVE
0THAT


0IF IT
0AINT BROKE,
0DONT FIX IT.


ENGINEERS

0BELIEVE
0THAT
0IF
0IT AINT BROKE,


0IT

DOESNT

0HAVE
0ENOUGH
0FEATURES YET."""

val output = (for (s <- source.split("\n\n").toList) yield {   // split on empty lines
            s.split("\n").toList                      // split on new lines 
            .filter(_.headOption.getOrElse("")=='0')}  // get rid of entries not starting with '0'
    ).filter(!_.isEmpty)                              // get rid of possible empty blocks

//output formatted for readability
scala> output: List[List[String]] = List(List(0MOST PEOPLE, 0BELIEVE, 0THAT), 
                                         List(0IF IT, 0AINT BROKE,, 0DONT FIX IT.),
                                         List(0BELIEVE, 0THAT, 0IF, 0IT AINT BROKE,), 
                                         List(0IT), 
                                         List(0HAVE, 0ENOUGH, 0FEATURES YET.))

更新: 如果你正在从文件中读取这些行,那么旧的命令式方法可能会运行得很好,特别是如果源文件很大:

import scala.collection.mutable.ListBuffer
val lb = ListBuffer[List[String]]()
val ml = ListBuffer[String]()
for (ll <- source.fromFile(<yourfile>)) {
    if (ll.isEmpty) { 
        if (!ml.isEmpty) lb += ml.toList 
        ml.clear
    } else if (ll(0)=='0') ml+=ll 
}
val output = lb.toList

答案 3 :(得分:0)

我对Scala不太熟悉,但我认为这是你正在寻找的正则表达式:

([A-Z]+[A-Z ]*)

查看实际操作:http://regex101.com/r/gY8lX6

编辑:/ /在这种情况下,您需要做的就是在捕获组的开头添加零:

(0[A-Z]+[A-Z ]*)