我一直试图过滤掉一堆文本文件。当我处理一个文件时,它一直在为我工作,但是当我尝试加载目录时,它会按预期停止工作。因此,我想我在这里遗漏了一些东西。这是代码
def nameRegexp: String = { "\\s{14,25}([A-Z]+\\s\\#\\d|[A-Z]+(\\.)?(\\s)?[A-Z]+).*" }
def isName(line: String) : Boolean = {
// Just to debug what's going on.
if(line.trim.split("\\s").length == 1) {
val x = nameRegexp.r.unapplySeq(line)
println(s" ${x} => ${line}")
}
nameRegexp.r.unapplySeq(line).isDefined && line.trim.length > 0
}
sc.wholeTextFiles("data")
.map(x => (x._2.split("\n")))
.foreach(x => x.foreach(j => isName(j)))
打印出来:
None => KYLE
None =>
None => STAN
None =>
None =>
None => CARTMAN
对战
scala> isName(" CARTMAN")
Some(List(CARTMAN, null, null)) => CARTMAN
res11: Boolean = true
因此,当我手动调用函数isName(String)
或处理一个文件时,当正则表达式与输入匹配时,它将返回true。但是,在处理多个文件时,它将匿名返回false。
为什么Spark表现得那样?
根据评论:
sc.wholeTextFiles("data").map(x => (x._2.split("\n"))).foreach(x => x.foreach(j => println(s"-- $j")))
逐行打印文件的内容。
-- KYLE
--
We can eat it at Cartman’s house and
-- see more naughty pictures of his mom!
--
-- CARTMAN
-- Knock it off, you guys!! She said she
-- was young and she needed the money!!
--
-- STAN
-- (Off-screen)
-- Cartman! The pictures were taken like
-- last month!!
答案 0 :(得分:0)
您可以使用textFile而不是wholeTextFile
Some(List(CARTMAN, null, null)) => CARTMAN
Some(List(CARTMAN, null, null)) => CARTMAN
Some(List(CARTMAN, null, null)) => CARTMAN
Some(List(CARTMAN, null, null)) => CARTMAN
res33: Array[Boolean] = Array(true, true, true, true)
输出:
<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '***********');
fbq('set','agent','tmgoogletagmanager', '***********');
fbq('track', "PageView");
</script>