我尝试创建一个从关系数据库获取数据并将其插入Hive表的函数。由于我使用Spark 1.6,我需要注册一个临时表,因为直接将数据帧写为Hive表不是compatible with Hive:
spark_conf = SparkConf()
sc = SparkContext(conf=spark_conf)
sqlContext = HiveContext(sc)
query = "(select * from employees where emp_no < 10008) as emp_alias"
df = sqlContext.read.format("jdbc").option("url", url) \
.option("dbtable", query) \
.option("user", user) \
.option("password", pswd).load()
df.registerTempTable('tempEmp')
sqlContext.sql('insert into table employment_db.hive_employees select * from tempEmp')
RDB中的employees
表包含几千条记录。运行我的程序后,我可以看到创建了两个镶木地板文件:
因此,当我在作业完成后尝试从Hive表中进行选择时,会丢失记录。
我有多个想法,这可能会导致问题:
registerTempTable
的懒惰评估造成的? Spark认为我不会使用这些记录吗?我熟悉生成器中的延迟评估,但我无法想象registerTempTable
函数中的懒惰评估是如何工作的。tmp
文件夹中的临时表?是否因为空间不足而引起的?我应该使用dropTempTable
功能吗?createOrReplaceTempView
更安全(尽管Spark 2中已弃用registerTempTable
)。更多信息
答案 0 :(得分:0)
您可以查看df.saveAsTable(“db.tempEmp”)
1创建一个新文件employee.txt及以下内容。
[root@quickstart spark]# vi employee.txt
Name, Age
Vinayak, 35
Nilesh, 37
Raju, 30
Karthik, 28
Shreshta,1
Siddhish, 2
2在spark-shell上执行以下命令
val employee = sc.textFile("file:///home/cloudera/workspace/spark/employee.txt")
val employeefirst = employee.first
val employeeMap = employee.
filter(e=>e!=employeefirst).
map(e=>{
val splitted = e.split(",")
val name = splitted(0).trim
val age = scala.util.Try(splitted(1).trim.toInt) getOrElse(0)
(name, age)
})
val employeeDF = employeeMap.toDF("Name", "age")
employeeDF.show()
scala> employeeDF.show()
+--------+---+
| Name|age|
+--------+---+
| Vinayak| 35|
| Nilesh| 37|
| Raju| 30|
| Karthik| 28|
|Shreshta| 1|
|Siddhish| 2|
+--------+---+
3创建一个新数据库。
hive> create database employeetest (optional);
OK
Time taken: 0.325 seconds
hive> use employeetest;
OK
Time taken: 0.153 seconds
4在employeetest数据库中创建表。
scala> employeeDF.saveAsTable("employeetest.Employee")
hive> show tables;
OK
employee
Time taken: 0.171 seconds, Fetched: 1 row(s)
hive> select * from employee;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.462 seconds, Fetched: 6 row(s)
或者您可以使用以下方法在spark-shell中创建表格
scala> employeeDF.registerTempTable("employeetohive")
scala> employeeDF.sqlContext.sql("select * from employeetohive").show
+--------+---+
| Name|age|
+--------+---+
| Vinayak| 35|
| Nilesh| 37|
| Raju| 30|
| Karthik| 28|
|Shreshta| 1|
|Siddhish| 2|
+--------+---+
scala> employeeDF.sqlContext.sql("create table employeetest.employeefromdf select * from employeetohive").show
hive> show tables;
OK
employee
employeefromdf
Time taken: 0.101 seconds, Fetched: 2 row(s)
hive> select * from employeefromdf;
OK
Vinayak 35
Nilesh 37
Raju 30
Karthik 28
Shreshta 1
Siddhish 2
Time taken: 0.246 seconds, Fetched: 6 row(s)