使用pyspark写入Cassandra时出现连接错误

时间:2015-09-15 06:48:20

标签: cassandra pyspark

我正在尝试使用Pyspark shell将数据写入Cassandra,命令:

REM yesterdays date
@echo off
set day=-1
echo >"%temp%\%~n0.vbs" s=DateAdd("d",%day%,now) : d=weekday(s)
echo>>"%temp%\%~n0.vbs" WScript.Echo year(s)^& right(100+month(s),2)^& right(100+day(s),2)
for /f %%a in ('cscript /nologo "%temp%\%~n0.vbs"') do set "result=%%a"
del "%temp%\%~n0.vbs"
set "YYYY=%result:~0,4%"
set "MM=%result:~4,2%"
set "DD=%result:~6,2%"
set "yesterday=%yyyy%%mm%%dd%"

mkdir C:\Users\ajay.shaan.shanmugam\Documents\currnewer
robocopy C:\Users\ajay.shaan.shanmugam\Documents\Source\ C:\Users\ajay.shaan.shanmugam\Documents\currnewer\ /move /maxage:%yesterday%
mkdir C:\Users\ajay.shaan.shanmugam\Documents\curr
robocopy C:\Users\ajay.shaan.shanmugam\Documents\currnewer\ C:\Users\ajay.shaan.shanmugam\Documents\curr\ /move /minage:%yesterday%
robocopy C:\Users\ajay.shaan.shanmugam\Documents\currnewer\ C:\Users\ajay.shaan.shanmugam\Documents\Source\ /move
rmdir /S /Q C:\Users\ajay.shaan.shanmugam\Documents\currnewer

但是我收到以下错误:

dataframe_name.write.format("org.apache.spark.sql.cassandra").options(table="table_name",keyspace="keyspace_name").save(mode="append")

我尝试在pyspark机器上使用python shell执行相同的操作。它工作正常。

15/09/15 06:37:18 ERROR DAGScheduler: Failed to update accumulators for ResultTask(2, 198)
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:579)
at java.net.Socket.connect(Socket.java:528)
at java.net.Socket.<init>(Socket.java:425)
at java.net.Socket.<init>(Socket.java:208)
at org.apache.spark.api.python.PythonAccumulatorParam.openSocket(PythonRDD.scala:813)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:828)
at org.apache.spark.api.python.PythonAccumulatorParam.addInPlace(PythonRDD.scala:798)
at org.apache.spark.Accumulable.$plus$plus$eq(Accumulators.scala:80)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:342)
at org.apache.spark.Accumulators$$anonfun$add$2.apply(Accumulators.scala:337)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.Accumulators$.add(Accumulators.scala:337)
at org.apache.spark.scheduler.DAGScheduler.updateAccumulators(DAGScheduler.scala:945)
at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1014)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

1 个答案:

答案 0 :(得分:0)

这看起来像Spark中的网络问题。如果没有确切版本的Spark和Spark Cassandra Connector,就很难诊断出来。我的猜测是驱动程序设置错误,无法与执行程序进行通信。您确定执行者可以访问您的驱动程序应用程序,反之亦然吗?

您可以随时测试设置--master local以查看网络不在时是否存在问题。