该文件的内容位于:http://pastebin.com/nAe9q9Kt(因为我在问题中不能有多个空行)
以下是我的崇高文字的截图。
SPACED INPUT EXAMPLE START
a
b
c
SPACED INPUT EXAMPLE END
您可以注意到大多数行begin with 0(zero), except the words ENGINEERS and DOESNT
和are separated by single blank line and sometimes by double blank lines.
基本上我想要的是:
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE", "0DONT FIX IT"),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET.")
)
我试着编写一个尾递归代码,但它最终运行得很好:)但是在一个巨大的文件(超过10K行)上运行需要太长时间(几分钟)
我想过使用Regex方法或通过Scala代码执行sed或awk等Unix命令来生成临时文件。我的猜测是它会比我目前的方法运行得更快。
有人可以帮我使用正则表达式吗?
这是我的尾递归Scala代码:
@scala.annotation.tailrec
def inner(remainingLines: List[String], previousLineIsBlank: Boolean, frames: List[List[String]], frame: List[String]): List[List[String]] = {
remainingLines match {
case Nil => frame :: frames
case line :: Nil if !previousLineIsBlank =>
inner(
remainingLines = Nil,
previousLineIsBlank = false,
frames = frame :: frames,
frame = line :: frame)
case line :: tail => {
line match {
case "" if previousLineIsBlank => // Current line is blank, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frame :: frames,
frame = List.empty[String])
case "" if !previousLineIsBlank => // Current line is blank, previous line is not blank
inner(
remainingLines = tail,
previousLineIsBlank = true,
frames = frames,
frame = frame)
case line if !line.startsWith("0") && previousLineIsBlank => // Current line is not blank and does not start with 0 (ENGINEER, DOESN'T), previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = frame)
case line if previousLineIsBlank => // Current line is not blank and does starts with 0, previous line is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line if !previousLineIsBlank => // Current line is not blank, previous line not is blank
inner(
remainingLines = tail,
previousLineIsBlank = false,
frames = frames,
frame = line :: frame)
case line => sys.error("Unmatched case = " + line)
}
}
}
}
答案 0 :(得分:1)
以下是awk
的方法。您可能需要找到一种方法将其合并到scala
代码中:
awk '
BEGIN { print "List(" }
/^0/ {
printf " %s", "List("
for(i = 1; i <= NF; i++) {
printf "%s%s" ,q $i q,(i==NF?"":", ")
}
print "),"
}
END { print ")" }' RS= FS='\n' q='"' file
List(
List("0MOST PEOPLE", "0BELIEVE", "0THAT"),
List("0IF IT", "0AINT BROKE,", "0DONT FIX IT."),
List("0BELIEVE", "0THAT", "0IF", "0IT AINT BROKE,"),
List("0IT"),
List("0HAVE", "0ENOUGH", "0FEATURES YET."),
)
答案 1 :(得分:1)
使用awk
awk 'BEGIN{print "List(" }
{ s=/^[0-9]/?1:0;i=s?i:i+1}
s{a[i]=a[i]==""?$0:a[i] OFS $0}
END{ for (j=1;j<=i;j++)
if (a[j]!="")
{ gsub(/\|/,"\",\"",a[j])
printf " list(\"%s\")\n", a[j]
}
print ")"
}' OFS="|" file
List(
list("0MOST PEOPLE","0BELIEVE","0THAT")
list("0IF IT","0AINT BROKE,","0DONT FIX IT.")
list("0BELIEVE","0THAT","0IF","0IT AINT BROKE,")
list("0IT")
list("0HAVE","0ENOUGH","0FEATURES YET.")
)
s=/^[0-9]/?1:0;i=s?i:i+1
标记(s和i)用于检测新记录。s{a[i]=a[i]==""?$0:a[i] OFS $0}
将每条记录(由非numbmer起始行分隔)保存到数组a
END
中的重置用于以期望格式打印出结果。OFS="|"
希望输入文件中没有char |
,如果有,请将其更改为其他字符,例如@,#等。答案 2 :(得分:1)
val source = """0MOST PEOPLE
0BELIEVE
0THAT
0IF IT
0AINT BROKE,
0DONT FIX IT.
ENGINEERS
0BELIEVE
0THAT
0IF
0IT AINT BROKE,
0IT
DOESNT
0HAVE
0ENOUGH
0FEATURES YET."""
val output = (for (s <- source.split("\n\n").toList) yield { // split on empty lines
s.split("\n").toList // split on new lines
.filter(_.headOption.getOrElse("")=='0')} // get rid of entries not starting with '0'
).filter(!_.isEmpty) // get rid of possible empty blocks
//output formatted for readability
scala> output: List[List[String]] = List(List(0MOST PEOPLE, 0BELIEVE, 0THAT),
List(0IF IT, 0AINT BROKE,, 0DONT FIX IT.),
List(0BELIEVE, 0THAT, 0IF, 0IT AINT BROKE,),
List(0IT),
List(0HAVE, 0ENOUGH, 0FEATURES YET.))
更新: 如果你正在从文件中读取这些行,那么旧的命令式方法可能会运行得很好,特别是如果源文件很大:
import scala.collection.mutable.ListBuffer
val lb = ListBuffer[List[String]]()
val ml = ListBuffer[String]()
for (ll <- source.fromFile(<yourfile>)) {
if (ll.isEmpty) {
if (!ml.isEmpty) lb += ml.toList
ml.clear
} else if (ll(0)=='0') ml+=ll
}
val output = lb.toList
答案 3 :(得分:0)
我对Scala不太熟悉,但我认为这是你正在寻找的正则表达式:
([A-Z]+[A-Z ]*)
查看实际操作:http://regex101.com/r/gY8lX6
编辑:/ /在这种情况下,您需要做的就是在捕获组的开头添加零:
(0[A-Z]+[A-Z ]*)