如何将Vector的PySpark DF转换为RDD?

时间:2017-12-01 16:10:37

标签: python apache-spark pyspark apache-spark-sql spark-dataframe

我有一个Spark数据框,其中包含一列的向量("要素"列,Vector Assembler的输出)。我试图将DF转换为RDD以对向量元素执行映射操作,但我只是在尝试转换为RDD时不断出现错误。

我的df看起来像:

+---------+---------------+
|       id|       features|
+---------+---------------+
|103043842| (54,[0],[1.0])|
|103044195|(54,[42],[1.0])|
|103044272|(54,[24],[1.0])|
+---------+---------------+

架构是:

root
 |-- id: long (nullable = true)
 |-- features: vector (nullable = true)

使用以下方法转换为RDD:

df.rdd

主要错误似乎是一个numpy导入错误,但我确实已导入它,因为我在我的脚本中的其他位置使用该库。

ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7fb3d66dcf50>, 

完整的错误堆栈是:

17/12/01 16:04:21 ERROR TaskSetManager: Task 0 in stage 24.0 failed 4 times; aborting job
Traceback (most recent call last):
  File "/home/hadoop/spark-playground.py", line 187, in <module>
    transformed_meta = transform_meta(meta_df)
  File "/home/hadoop/spark-playground.py", line 102, in transform_meta
    print(t.take(3))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1343, in take
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 992, in runJob
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 4 times, most recent failure: Lost task 0.3 in stage 24.0 (TID 1242, ip-172-31-27-17.us-east-2.compute.internal, executor 6): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1339, in takeUpToNumLeft
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 144, in load_stream
    yield self._read_with_length(stream)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 454, in loads
    return pickle.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
    return _parse_datatype_json_value(json.loads(json_string))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
    return _all_complex_types[tpe].fromJson(json_value)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 562, in fromJson
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 428, in fromJson
    _parse_datatype_json_value(json["type"]),
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 913, in _parse_datatype_json_value
    return UserDefinedType.fromJson(json_value)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 705, in fromJson
    m = __import__(pyModule, globals(), locals(), [pyClass])
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/base.py", line 21, in <module>
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7fb3d66dcf50>, (u'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"features","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"2802"},{"idx":1,"name":"521"},{"idx":2,"name":"2864"},{"idx":3,"name":"2770"},{"idx":4,"name":"3026"},{"idx":5,"name":"2744"},{"idx":6,"name":"3042"},{"idx":7,"name":"6769"},{"idx":8,"name":"3084"},{"idx":9,"name":"2742"},{"idx":10,"name":"7549"},{"idx":11,"name":"2986"},{"idx":12,"name":"3002"},{"idx":13,"name":"3082"},{"idx":14,"name":"2863"},{"idx":15,"name":"2562"},{"idx":16,"name":"137"},{"idx":17,"name":"2862"},{"idx":18,"name":"335486"},{"idx":19,"name":"2746"},{"idx":20,"name":"6771"},{"idx":21,"name":"6767"},{"idx":22,"name":"541"},{"idx":23,"name":"2987"},{"idx":24,"name":"2767"},{"idx":25,"name":"2745"},{"idx":26,"name":"6772"},{"idx":27,"name":"2985"},{"idx":28,"name":"26089"},{"idx":29,"name":"6349"},{"idx":30,"name":"964"},{"idx":31,"name":"136"},{"idx":32,"name":"2763"},{"idx":33,"name":"2747"},{"idx":34,"name":"2766"},{"idx":35,"name":"6351"},{"idx":36,"name":"2765"},{"idx":37,"name":"7554"},{"idx":38,"name":"3085"},{"idx":39,"name":"1382"},{"idx":40,"name":"3003"},{"idx":41,"name":"7553"},{"idx":42,"name":"2769"},{"idx":43,"name":"2762"},{"idx":44,"name":"7555"},{"idx":45,"name":"3023"},{"idx":46,"name":"6770"},{"idx":47,"name":"6350"},{"idx":48,"name":"2768"},{"idx":49,"name":"7551"},{"idx":50,"name":"2764"},{"idx":51,"name":"966"},{"idx":52,"name":"4822"},{"idx":53,"name":"__unknown"}]},"num_attrs":54}}}]}',))

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:446)
    at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/worker.py", line 177, in main
    process()
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/worker.py", line 172, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 268, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1339, in takeUpToNumLeft
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 144, in load_stream
    yield self._read_with_length(stream)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/serializers.py", line 454, in loads
    return pickle.loads(obj)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 894, in _parse_datatype_json_string
    return _parse_datatype_json_value(json.loads(json_string))
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 911, in _parse_datatype_json_value
    return _all_complex_types[tpe].fromJson(json_value)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 562, in fromJson
    return StructType([StructField.fromJson(f) for f in json["fields"]])
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 428, in fromJson
    _parse_datatype_json_value(json["type"]),
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 913, in _parse_datatype_json_value
    return UserDefinedType.fromJson(json_value)
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/sql/types.py", line 705, in fromJson
    m = __import__(pyModule, globals(), locals(), [pyClass])
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/__init__.py", line 22, in <module>
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/base.py", line 21, in <module>
  File "/mnt/yarn/usercache/hadoop/appcache/application_1512140843138_0020/container_1512140843138_0020_01_000007/pyspark.zip/pyspark/ml/param/__init__.py", line 26, in <module>
ImportError: ('No module named numpy', <function _parse_datatype_json_string at 0x7fb3d66dcf50>, (u'{"type":"struct","fields":[{"name":"id","type":"long","nullable":true,"metadata":{}},{"name":"features","type":{"type":"udt","class":"org.apache.spark.ml.linalg.VectorUDT","pyClass":"pyspark.ml.linalg.VectorUDT","sqlType":{"type":"struct","fields":[{"name":"type","type":"byte","nullable":false,"metadata":{}},{"name":"size","type":"integer","nullable":true,"metadata":{}},{"name":"indices","type":{"type":"array","elementType":"integer","containsNull":false},"nullable":true,"metadata":{}},{"name":"values","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}}]}},"nullable":true,"metadata":{"ml_attr":{"attrs":{"binary":[{"idx":0,"name":"2802"},{"idx":1,"name":"521"},{"idx":2,"name":"2864"},{"idx":3,"name":"2770"},{"idx":4,"name":"3026"},{"idx":5,"name":"2744"},{"idx":6,"name":"3042"},{"idx":7,"name":"6769"},{"idx":8,"name":"3084"},{"idx":9,"name":"2742"},{"idx":10,"name":"7549"},{"idx":11,"name":"2986"},{"idx":12,"name":"3002"},{"idx":13,"name":"3082"},{"idx":14,"name":"2863"},{"idx":15,"name":"2562"},{"idx":16,"name":"137"},{"idx":17,"name":"2862"},{"idx":18,"name":"335486"},{"idx":19,"name":"2746"},{"idx":20,"name":"6771"},{"idx":21,"name":"6767"},{"idx":22,"name":"541"},{"idx":23,"name":"2987"},{"idx":24,"name":"2767"},{"idx":25,"name":"2745"},{"idx":26,"name":"6772"},{"idx":27,"name":"2985"},{"idx":28,"name":"26089"},{"idx":29,"name":"6349"},{"idx":30,"name":"964"},{"idx":31,"name":"136"},{"idx":32,"name":"2763"},{"idx":33,"name":"2747"},{"idx":34,"name":"2766"},{"idx":35,"name":"6351"},{"idx":36,"name":"2765"},{"idx":37,"name":"7554"},{"idx":38,"name":"3085"},{"idx":39,"name":"1382"},{"idx":40,"name":"3003"},{"idx":41,"name":"7553"},{"idx":42,"name":"2769"},{"idx":43,"name":"2762"},{"idx":44,"name":"7555"},{"idx":45,"name":"3023"},{"idx":46,"name":"6770"},{"idx":47,"name":"6350"},{"idx":48,"name":"2768"},{"idx":49,"name":"7551"},{"idx":50,"name":"2764"},{"idx":51,"name":"966"},{"idx":52,"name":"4822"},{"idx":53,"name":"__unknown"}]},"num_attrs":54}}}]}',))

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more

0 个答案:

没有答案
相关问题