Question

（使用Apache Spark 1.6.0）大家好，

我有SparseVector（基本上由两个numpy数组定义，即values和indices。我想获得最高值及其指数，我用它来做：

r = df.map(lambda row: Row(**dict(row.asDict(), top=f(vec)))))

其中f函数以下列方式返回[ [sorted_indices], [sorted_values] ]：

def f(v):
    m, i = zip(*sorted(zip(v.values, v.indices), reverse=True))
    m = [ float(j) for j in m]
    i = [ int(j) for j in i]
    return [i, m]

此时r是pyspark.rdd.PipelinedRDD，我可以查看我的值是否正常使用，例如。

r.first().top[1]

当我尝试使用DataFrame时出现问题：

df2 = r.toDF()

然后我的值只有None，即

df2.first().top[1] # i.e. the highest values of the first Vector

显示None。

因此，toDF()函数看起来真的会破坏我的数据。如果Spark无法处理内置浮点类型，这将是非常奇怪的。

有什么想法吗？ THX

Answer 1

它不起作用，因为类型不匹配。如果您查看类型，您会看到top列表示为array<array<bigint>>，而值应为array<float>。您的功能应该是可以转换为struct列struct<array<bigint>, array<float>>的对象。一个明显的选择是tuple或Row：

from pyspark.sql import Row

def f(v):
    m, i = zip(*sorted(zip(v.values, v.indices), reverse=True))
    m = [ float(j) for j in m]
    i = [ int(j) for j in i]
    return Row(indices=i, values=m)

另外，如果vector已经在DataFrame中，那么最好在这里使用UDF：

from pyspark.sql.functions import udf, col
from pyspark.sql.types import *

schema = StructType([
    StructField("indices", ArrayType(IntegerType())), 
    StructField("values",  ArrayType(DoubleType()))
])

df.withColumn("top", udf(f, schema)(col("vec_column")))

rdd.toDF（）将float更改为None

1 个答案: