Question

我有一个来自sql查询的数据框

df1 = sqlContext.sql("select * from table_test")

我需要将此数据帧转换为libsvm格式，以便可以将其作为输入提供

pyspark.ml.classification.LogisticRegression

我试着做以下事情。但是，由于我正在使用spark 1.5.2

，因此导致以下错误

df1.write.format("libsvm").save("data/foo")
Failed to load class for data source: libsvm

我想改用MLUtils.loadLibSVMFile。我在防火墙后面，不能直接pip安装它。所以我下载了文件，scp-ed然后手动安装它。一切似乎工作正常，但我仍然得到以下错误

import org.apache.spark.mllib.util.MLUtils
No module named org.apache.spark.mllib.util.MLUtils

问题1：我的上述方法是将数据帧转换为正确方向的libsvm格式。问题2：如果问题1为“是”，如何让MLUtils正常工作。如果“否”，将数据帧转换为libsvm格式的最佳方法是什么

Answer 1

我会这样做（这只是一个任意数据帧的例子，我不知道你的df1是如何完成的，重点是数据转换）：

这是我将数据帧转换为libsvm格式的方法：

# ... your previous imports

from pyspark.mllib.util import MLUtils
from pyspark.mllib.regression import LabeledPoint

# A DATAFRAME
>>> df.show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
|  1|  3|  6|  
|  4|  5| 20|
|  7|  8|  8|
+---+---+---+

# FROM DATAFRAME TO RDD
>>> c = df.rdd # this command will convert your dataframe in a RDD
>>> print (c.take(3))
[Row(_1=1, _2=3, _3=6), Row(_1=4, _2=5, _3=20), Row(_1=7, _2=8, _3=8)]

# FROM RDD OF TUPLE TO A RDD OF LABELEDPOINT
>>> d = c.map(lambda line: LabeledPoint(line[0],[line[1:]])) # arbitrary mapping, it's just an example
>>> print (d.take(3))
[LabeledPoint(1.0, [3.0,6.0]), LabeledPoint(4.0, [5.0,20.0]), LabeledPoint(7.0, [8.0,8.0])]

# SAVE AS LIBSVM
>>> MLUtils.saveAsLibSVMFile(d, "/your/Path/nameFolder/")

您将在“/ your / Path / nameFolder / part-0000 *”文件中看到的内容是：

1.0 1：3.0 2：6.0

4.0 1：5.0 2：20.0

7.0 1：8.0 2：8.0

请参阅here了解LabeledPoint文档

Answer 2

我必须这样才能使它工作

D.map(lambda line: LabeledPoint(line[0],[line[1],line[2]]))

将数据帧转换为libsvm格式

2 个答案: