我正在尝试创建Spark Scala代码,该代码可以读取具有不同列数的任何文件。我可以动态编写scala / spark代码并进行编译和执行。我真的需要SBT吗?实现此目标的最佳方法是什么?
当我使用shell脚本运行scala代码或 scalac code.scala表示
hadoop@namenode1:/usr/local/scala/examples$ ./survey.sh
/usr/local/scala/examples/./survey.sh:6: error: not found: value spark
val survey = spark.read.format("com.databricks.spark.csv").option("header","true").option("nullValue","NA").option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss").option("mode","failfast").option("inferchema","true").load("/tmp/survey.csv")
^
/usr/local/scala/examples/./survey.sh:19: error: not found: type paste
:paste
^
/usr/local/scala/examples/./survey.sh:37: error: not found: value udf
val parseGenderUDF = udf( parseGender _ )
^
three errors found
我想要类似
使用shell脚本动态生成file.scala代码,然后对其进行编译 使用
scalac file.scala
然后执行它
scala file.scala
但这是可能的。 怎么做?
hadoop@namenode1:/usr/local/spark/examples/src/main/scala/org/apache/spark/examples$ cat Survey.scala
import org.apache.spark.sql.{SparkSession}
object Survey {
def main(args: Array[String]) {
val spark= SparkSession.builder
.master("local")
.appName("Survey")
.getOrCreate()
val survey = spark.read.format("com.databricks.spark.csv").option("header","true").option("nullValue","NA").option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss").option("mode","failfast").option("inferchema","true").load("/tmp/survey.csv")
survey.show()
}
}
执行时出错
hadoop@namenode1:/usr/local/spark/examples/src/main/scala/org/apache/spark/examples$ scalac Survey.scala
Survey.scala:1: error: object apache is not a member of package org
import org.apache.spark.sql.{SparkSession}
^
Survey.scala:5: error: not found: value SparkSession
val spark= SparkSession.builder
^
two errors found
hadoop@namenode1:/usr/local/spark/examples/src/main/scala/org/apache/spark/examples$
答案 0 :(得分:0)
要提交Spark作业,您必须使用spark-submit命令或在spark-shell中执行Scala脚本。 Apache Livy提供了REST API来提交Spark作业。
答案 1 :(得分:0)
您需要创建sparkSession示例:
import org.apache.spark.sql.{SparkSession}
val spark= SparkSession.builder
.master("local")
.appName("MYAPP")
.getOrCreate()
val survey = spark.read.format("com.databricks.spark.csv").option("header","true").option("nullValue","NA").option("timestampFormat","yyyy-MM-dd'T'HH:mm:ss").option("mode","failfast").option("inferchema","true").load("/tmp/survey.csv")
//对于udf,您需要
import org.apache.spark.sql.functions._
val parseGenderUDF = udf( parseGender _ )
我希望对您有帮助
答案 2 :(得分:0)