当转换为StringType时,DenseVector输出中的额外值是多少?
以下内容应该是可重现的。
spark = pyspark.sql.SparkSession.builder.getOrCreate()
spark.version
# u'2.2.0.cloudera1'
from pyspark.ml.linalg import DenseVector
import pyspark.sql.functions as F
import pyspark.sql.types as T
testdf = spark.createDataFrame([\
(DenseVector([2, 3]),),\
(DenseVector([4, 5]),),\
(DenseVector([6, 7]),)],\
['DenseVectors'])
testdf \
.withColumn('DenseVector as String', F.col('DenseVectors')
.cast(T.StringType())) \
.show(truncate=False)
#+------------+----------------------------------------------------------+
#|DenseVectors|DenseVector as String |
#+------------+----------------------------------------------------------+
#|[2.0,3.0] |[6,1,0,0,2800000020,2,0,4000000000000000,4008000000000000]|
#|[4.0,5.0] |[6,1,0,0,2800000020,2,0,4010000000000000,4014000000000000]|
#|[6.0,7.0] |[6,1,0,0,2800000020,2,0,4018000000000000,401c000000000000]|
#+------------+----------------------------------------------------------+
答案 0 :(得分:3)
这些不是额外的值。向量实现为UserDefinedType
(org.apache.spark.mllib.linalg.VectorUDT
/ org.apache.spark.ml.linalg.VectorUDT
- 使用Spark 2,你通常应该使用后一个)并且没有有用的强制转换实现(有一个很好,有一个,所以也许你可以打开一个JIRA ticket,如果没有的话。)
你看到的只是内部结构的反映,其中包含:
并不是人类可读的。
要获得可读输出,请使用udf
@F.udf
def to_string(v):
return str(v)
testdf.select(to_string("DenseVectors")).show()
# +-----------------------+
# |to_string(DenseVectors)|
# +-----------------------+
# | [2.0,3.0]|
# | [4.0,5.0]|
# | [6.0,7.0]|
# +-----------------------+