Spark - 如何在不使用pandas的情况下将数据帧或rdd转换为spark matrix或numpy数组

时间:2017-01-12 15:13:25

标签: numpy apache-spark pyspark spark-dataframe bigdata

我有20TB的数据。我尝试将spark数据帧转换为spark matrix,如下所示(Solution used found here): 我的数据框看起来像这样:

+-------+---------------+--------------------+
|goodsID|customer_group|customer_phone_number|
+-------+---------------+--------------------+
|    123|          XXXXX|            XXXXXXXX|
|    432|          YYYYY|            XXXXXXXX|
+-------+---------------+--------------------+

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix

mat = IndexedRowMatrix(mydataframe.map(lambda row: IndexedRow(*row)))
mat.numRows()
mat.numCols()

但是它给了我以下错误:

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/test/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 111, in main
    process()
  File "/home/test/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py", line 106, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/test/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/test/spark-1.6.0-bin-hadoop2.6/python/pyspark/rdd.py", line 1293, in takeUpToNumLeft
    yield next(iterator)
  File "<stdin>", line 1, in <lambda>
TypeError: __init__() takes exactly 3 arguments (4 given)

所以我的问题是

  1. 我怎样才能在火花中实现这一目标?
  2. 另外如何将我的数据帧转换为numpy数组?
  3. 使用带火花的熊猫真的很糟糕吗?

1 个答案:

答案 0 :(得分:0)

  • 输入数据的类型可能有误。矢量值必须为Double(Python float)。

  • 您没有以正确的方式使用IndexedRow。它需要两个参数 - index和vector。如果我们假设数据类型正确

    mat = IndexedRowMatrix(mydataframe.map(
     lambda row: IndexedRow(row[0], Vectors.dense(row[1:]))))
    
  • 熊猫坏了吗?对于20TB的数据?不是there exist distributed Python libraries with similar API的最佳选择。