使用Spark过滤一堆文本文件的行

时间:2017-09-20 09:47:36

标签: scala apache-spark

我一直试图过滤掉一堆文本文件。当我处理一个文件时,它一直在为我工作,但是当我尝试加载目录时,它会按预期停止工作。因此,我想我在这里遗漏了一些东西。这是代码

  def nameRegexp: String = { "\\s{14,25}([A-Z]+\\s\\#\\d|[A-Z]+(\\.)?(\\s)?[A-Z]+).*" }

  def isName(line: String) : Boolean = {
    // Just to debug what's going on.
    if(line.trim.split("\\s").length == 1) {
      val x = nameRegexp.r.unapplySeq(line)
      println(s" ${x} => ${line}")
    }
    nameRegexp.r.unapplySeq(line).isDefined && line.trim.length > 0
  }

  sc.wholeTextFiles("data")
    .map(x => (x._2.split("\n")))
    .foreach(x => x.foreach(j => isName(j)))

打印出来:

 None =>                     KYLE
 None => 
 None =>                     STAN
 None => 
 None => 
 None =>                     CARTMAN

对战

scala> isName("                     CARTMAN")
 Some(List(CARTMAN, null, null)) =>                      CARTMAN
res11: Boolean = true

因此,当我手动调用函数isName(String)或处理一个文件时,当正则表达式与输入匹配时,它将返回true。但是,在处理多个文件时,它将匿名返回false。

为什么Spark表现得那样?

澄清

根据评论:

sc.wholeTextFiles("data").map(x => (x._2.split("\n"))).foreach(x => x.foreach(j => println(s"-- $j")))

逐行打印文件的内容。

--                     KYLE
-- 
                We can eat it at Cartman’s house and
--              see more naughty pictures of his mom!
-- 
--                        CARTMAN
--              Knock it off, you guys!! She said she
--              was young and she needed the money!!
-- 
--                        STAN
--                  (Off-screen)
--              Cartman! The pictures were taken like
--              last month!!

1 个答案:

答案 0 :(得分:0)

您可以使用textFile而不是wholeTextFile

 Some(List(CARTMAN, null, null)) =>                      CARTMAN
 Some(List(CARTMAN, null, null)) =>                      CARTMAN
 Some(List(CARTMAN, null, null)) =>                      CARTMAN
 Some(List(CARTMAN, null, null)) =>                      CARTMAN
res33: Array[Boolean] = Array(true, true, true, true)

输出:

<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?
n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');

fbq('init', '***********');
fbq('set','agent','tmgoogletagmanager', '***********');
fbq('track', "PageView");
</script>