与save to JDBC相关,尝试导入文本文件并保存到Hive JDBC文件以供报告工具导入。
我们正在运行spark-1.5.1-bin-hadoop2.6(master + 1 slave),JDBC thrift服务器和beeline客户端。它们似乎都互相连接和沟通。根据我的理解,Hive包含在datanucleus jar中的这个版本中。我已经配置了目录来保存Hive文件,但是没有conf / hive-config.xml。
简单输入CSV文件:
Administrator,FiveHundredAddresses1,92121
Ann,FiveHundredAddresses2,92109
Bobby,FiveHundredAddresses3,92101
Charles,FiveHundredAddresses4,92111
使用
在beeline客户端预先创建了users表 CREATE TABLE users(first_name STRING, last_name STRING, zip_code STRING);
show tables; // it's there
对于master上的scala REPL会话:
val connectionUrl = "jdbc:hive2://x.y.z.t:10000/users?user=blah&password="
val userCsvFile = sc.textFile("/home/blah/Downloads/Users4.csv")
case class User(first_name:String, last_name:String, work_zip:String)
val users = userCsvFile.map(_.split(",")).map(l => User(l(0), l(1), l(2)))
val usersDf = sqlContext.createDataFrame(users)
usersDf.count() // 4
usersDf.schema // res92: org.apache.spark.sql.types.StructType = StructType(StructField(first_name,StringType,true), StructField(last_name,StringType,true), StructField(work_zip,StringType,true))
usersDf.insertIntoJDBC(connectionUrl,"users",true)
OR
usersDf.createJDBCTable(connectionUrl, "users", true) // w/o beeline creation
OR
val properties = new java.util.Properties
properties.setProperty("user", "blah")
properties.setProperty("password", "blah")
val connectionUrl = "jdbc:hive2://172.16.3.10:10000"
contactsDf.write.jdbc(connectionUrl,"contacts", properties)
抛出
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
java.sql.SQLException: org.apache.spark.sql.AnalysisException: cannot recognize input near 'TEXT' ',' 'last_name' in column type; line 1 pos
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at org.apache.hive.jdbc.HiveStatement.executeUpdate(HiveStatement.java:406)
at org.apache.hive.jdbc.HivePreparedStatement.executeUpdate(HivePreparedStatement.java:119)
at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:275)
at org.apache.spark.sql.DataFrame.insertIntoJDBC(DataFrame.scala:1629)
我出错的任何想法?这个版本真的可以从DataFrame中编写JDBC文件吗?
感谢您的帮助!
乔恩
答案 0 :(得分:1)
经过大量搜索(现在可行),您可以在REPL中执行此操作:
import org.apache.spark.sql.SaveMode
contactsDf.saveAsTable("contacts", SaveMode.Overwrite)
我还配置了$ SPARK_INSTALL_LOC / conf / hive-site.xml,如下所示:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby:;databaseName=metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.EmbeddedDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
<property>
<name>hive.metastore.warehouse.dir</name>
<value>/user/hive-warehouse</value>
<description>Where to store metastore data</description>
</property>
</configuration>
另一个关键是,由于Derby的线程限制,使用Derby作为Hive支持数据库,你不能(至少我如何配置它)同时运行ThriftJdbc服务器和REPL。但是,如果它可以用Postgres或MySQL等重新配置,也许可以同时访问。