java.lang.RuntimeException:com.datastax.bdp.fs.model.NoSuchFileException:找不到文件:/ tmp / hive /

时间:2018-05-21 08:47:08

标签: scala apache-spark dataframe datastax-enterprise spark-cassandra-connector

我有以下代码:

def main(args: Array[String]) {

val conf = new SparkConf()
  .setAppName("Fleet")
  .set("spark.executor.memory", "1g")
  .set("spark.driver.memory", "2g")
  .set("spark.submit.deployMode", "cluster")
  .set("spark.executor.instances", "4")
  .set("spark.executor.cores", "3")
  .set("spark.cores.max", "12")
  .set("spark.driver.cores", "4")
  .set("spark.ui.port", "4040")
  .set("spark.streaming.backpressure.enabled", "true")
  .set("spark.streaming.kafka.maxRatePerPartition", "30")

val spark = SparkSession
  .builder
  .appName("Fleet")
  .config("spark.cassandra.connection.host", "192.168.0.40")
  .config("spark.cassandra.connection.port", "9042")
  .config("spark.submit.deployMode", "cluster")
  .master("local[*]")
  .getOrCreate()

val sc = SparkContext.getOrCreate(conf)
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val topics = Map("historyfleet" -> 1) 
val kafkaStream = KafkaUtils.createStream(ssc, "192.168.0.40:2181", "fleetgroup", topics)

kafkaStream.foreachRDD(rdd =>
  {
    val dfs = rdd.toDF()
    println(dfs.show())
    dfs.write.format("org.apache.spark.sql.cassandra").options(Map("table" -> "test", "keyspace" -> "test_db")).mode(SaveMode.Append).save()
  })
ssc.start()
ssc.awaitTermination()

}

我可以在本地机器上从Eclipse执行此程序,但是当尝试通过群集上的spark提交作业执行时,它会出现错误: -

ERROR 2018-05-21 13:00:27,009 org.apache.spark.deploy.DseSparkSubmitBootstrapper: Failed to start or submit Spark application
java.lang.RuntimeException: com.datastax.bdp.fs.model.NoSuchFileException: File not found: /tmp/hive/
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522) ~[hive-exec-1.2.1.spark2.jar:1.2.1.spark2]
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:189) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) ~[na:1.8.0_161]
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) ~[na:1.8.0_161]
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) ~[na:1.8.0_161]
at java.lang.reflect.Constructor.newInstance(Constructor.java:423) ~[na:1.8.0_161]
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:258) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:359) ~[spark-hive_2.11-2.0.2.16.jar:2.0.2.16]

我的想法是从Kafka流中获取记录并将数据推送到Cassandra。谢谢,

1 个答案:

答案 0 :(得分:0)

您需要将dsefs键空间的复制因子(rf)增加到大于1的值。此外,dsefs键空间(与任何其他键空间一样)最适合NetworkTopologyStrategy。这是一个用rf = 3改变策略的命令。

ALTER KEYSPACE dsefs WITH replication = {'class':'NetworkTopologyStrategy', '<YOUR DC HERE>': '3'}

更改键空间后,您需要运行节点修复所有节点。

nodetool repair dsefs

除此之外,您可以从DSEFS中删除/tmp/hive并使用

重新创建它
dse fs
mkdir -p -m 733 /tmp/hive