运行spark脚本时一切正常:
from pyspark import SparkConf, SparkContext
es_read_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_write_conf = { "es.nodes" : "elasticsearch", "es.port" : "9200", "es.resource" : "secse/monologue"}
es_rdd = sc.newAPIHadoopRDD(inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf=es_read_conf)
doc = es_rdd.map(lambda a: (a[1]) )
直到我想尝试从文档中取出一个对象:
doc.take(1)
15/09/24 15:30:36 INFO SparkContext: Starting job: runJob at PythonRDD.scala:361
15/09/24 15:30:36 INFO DAGScheduler: Got job 3 (runJob at PythonRDD.scala:361) with 1 output partitions
15/09/24 15:30:36 INFO DAGScheduler: Final stage: ResultStage 3(runJob at PythonRDD.scala:361)
15/09/24 15:30:36 INFO DAGScheduler: Parents of final stage: List()
15/09/24 15:30:36 INFO DAGScheduler: Missing parents: List()
15/09/24 15:30:36 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43), which has no missing parents
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(5496) called with curMem=866187, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9 stored as values in memory (estimated size 5.4 KB, free 529.4 MB)
15/09/24 15:30:36 INFO MemoryStore: ensureFreeSpace(3326) called with curMem=871683, maxMem=556038881
15/09/24 15:30:36 INFO MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 3.2 KB, free 529.4 MB)
15/09/24 15:30:36 INFO BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:54195 (size: 3.2 KB, free: 530.2 MB)
15/09/24 15:30:36 INFO SparkContext: Created broadcast 9 from broadcast at DAGScheduler.scala:850
15/09/24 15:30:36 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 3 (PythonRDD[9] at RDD at PythonRDD.scala:43)
15/09/24 15:30:36 INFO TaskSchedulerImpl: Adding task set 3.0 with 1 tasks
15/09/24 15:30:36 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, ANY, 23112 bytes)
15/09/24 15:30:36 INFO Executor: Running task 0.0 in stage 3.0 (TID 3)
15/09/24 15:30:36 INFO NewHadoopRDD: Input split: ShardInputSplit [node=[OQfqJqLGQje3obkkKRFAag/Hargen the Measurer|172.17.0.1:9200],shard=0]
15/09/24 15:30:36 WARN EsInputFormat: Cannot determine task id...
15/09/24 15:30:37 INFO PythonRDD: Times: total = 483, boot = 285, init = 197, finish = 1
15/09/24 15:30:37 INFO Executor: Finished task 0.0 in stage 3.0 (TID 3). 3561 bytes result sent to driver
15/09/24 15:30:37 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 518 ms on localhost (1/1)
15/09/24 15:30:37 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
15/09/24 15:30:37 INFO DAGScheduler: ResultStage 3 (runJob at PythonRDD.scala:361) finished in 0.521 s
15/09/24 15:30:37 INFO DAGScheduler: Job 3 finished: runJob at PythonRDD.scala:361, took 0.559442 s
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/lucas/spark/spark/python/pyspark/rdd.py", line 1299, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/lucas/spark/spark/python/pyspark/context.py", line 916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/home/lucas/spark/spark/python/pyspark/sql/utils.py", line 36, in deco
return f(*a, **kw)
File "/home/lucas/spark/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: java.net.BindException: Cannot assign requested address
at java.net.PlainSocketImpl.socketBind(Native Method)
at java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
at java.net.ServerSocket.bind(ServerSocket.java:376)
at java.net.ServerSocket.<init>(ServerSocket.java:237)
at org.apache.spark.api.python.PythonRDD$.serveIterator(PythonRDD.scala:605)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:363)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:745)
我不知道我做错了什么。