Question

我有以下源文件

id,name,year,rating,duration

 1. The Nightmare Before Christmas,1993,3.9,4568
 2. The Mummy,1932,3.5,4388
 3. Orphans of the Storm,1921,3.2,9062
 4. The Object of Beauty,1991,2.8,6150
 5. Night Tide,1963,2.8,5126
 6. One Magic Christmas,1985,3.8,5333

我正在尝试filter year=2012以及以下工作的所有行。

c.map(_.split(",")).filter(x=>x(2).toInt==2012)

但是如何使用placeholder语法（_）???

实现相同的目标

我可以在placeholder函数中使用_语法（map）（例如rdd.map((_.split(",")) )

请建议。

Answer 1

这就是你要找的东西

c.map(_.split(",")).filter(_(2).toInt==2012)

但我建议您使用Spark-CSV来读取像

这样的csv文件

val df1 = spark.read.option("inferSchema", true)
              .option("header",true)
              .option("delimiter", ",")
              .csv("data1.csv")

然后您可以轻松过滤

df1.filter($"year" === "2012")

希望这有帮助

Answer 2

您只需执行以下操作即可使用占位符

c.map(_.split(",")).filter(_(2).toInt==2012).map(_.toSeq).foreach(println)

但如果您知道您的数据是固定长度的话，我建议您使用案例类

case class row(id: String,
             name: String,
             year: String,
             rating: String,
             duration: String)

您可以将其用作

    c.map(_.split(",", -1)).map(array => row(array(0),array(1),array(2),array(3),array(4))).filter(x => x.year.toInt == 2012).foreach(println)

或者为了安全起见，您可以将Option组合为

c.map(_.split(",", -1)).map(array => {
  row(Option(array(0)) getOrElse "",
    Option(array(1)) getOrElse "",
    Option(array(2)) getOrElse "",
    Option(array(3)) getOrElse "",
    Option(array(4)) getOrElse "")
  })
  .filter(x => x.year.toInt == 2012)
  .foreach(println)

使用占位符语法

2 个答案: