CSV样本:EMPLOYEE.CSV:
emp_name,emp_badge,door_number,date_time,usage_type
Jean-Paul Ranu,24441foobar,5,22:36:27,ENTRANCE
Raoul Raoul,7555foobar,5,01:08:49,ENTRANCE
Henri Papier,66686foobar,4,03:13:16,ENTRANCE
Gilles Fernandez,36664foobar,3,20:55:11,ENTRANCE
Jean Bono,27775foobar,4,18:45:42,EXIT
Laure Eal,53450foobar,1,13:42:12,ENTRANCE
SPARK-SCALA代码:
import org.apache.spark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.SparkSession
object MonObjet {
def main(args:Array[String]){
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("monTruc")
.getOrCreate
val conf = new SparkConf()
.setMaster("local")
.setAppName("myApp")
.set("spark.driver.allowMultipleContexts", "true")
val sc = new SparkContext(conf)
val df = spark.read.format("csv").option("header", "true").option("inferSchema","true").option("sep",",").load("C:/Users/Houssemus/Desktop/emp_data.csv")
df.createOrReplaceTempView("employee")
val req=spark.sql("SELECT COUNT(emp_name) FROM employee ").show()
// df.show()
}
}
我导入了一个用python创建的csv文件,以便使用scala在spark中进行预处理。导入后,我可以看到数据,但查询后返回零。
答案 0 :(得分:4)
在最新版本的spark中,不需要像定义sc
那样定义spark上下文。因此,在构建spark会话之后,您再次定义了spark上下文,这会导致一些未配置。删除代码val sc
的定义即可。
作为第二个问题,请尝试以下代码。
val req = spark.sql("""
SELECT door_number,
Count(door_number) AS count
FROM employee
WHERE usage_type = 'ENTRANCE'
GROUP BY door_number
ORDER BY count DESC
""").show()
它将给出这样的结果。
+-----------+-----+
|door_number|count|
+-----------+-----+
| 5| 2|
| 3| 1|
| 1| 1|
| 4| 1|
+-----------+-----+