Question

我试图将以下RDD.take（1）转换为dataframe，

[(697, [7, 7, 0.368, 1, 1, 0, 0.879]), (403, [1, 1, 0.0, 0, 0, 0, 0.4]), (485, [3, 4, 0.062, 1, 1, 0, 0.372])])]

使用

rdd.map(lambda p: get_m(slist)).toDF().show(1,False)

get_m返回排序列表

sorted(m.items(), key=lambda x:x[1][6], reverse = True)[0:3]

结果，

|[[697,WrappedArray(7, 7, null, 1, 1, 0, null)],[403,WrappedArray(1, 1, null, 0, 0, 0, null)],[485,WrappedArray(3, 4, null, 1, 1, 0, null)]]

我的所有浮点值都自动推断为空值

所以，我尝试使用我自己的架构如下：

field1 = [
StructField("x1", IntegerType(), True),
StructField("x2", IntegerType(), True),
StructField("x3", FloatType(), True),
StructField("x4", IntegerType(), True),
StructField("x5", IntegerType(), True),
StructField("x6", IntegerType(), True),
StructField("x7", FloatType(), True)
]

field2 = StructType([
StructField("id", IntegerType(), True),
StructField("result", StructType(field1), True)
])

schema = StructType([
StructField("match_1", StructType(field2), True),
StructField("match_2", StructType(field2), True),
StructField("match_3", StructType(field2), True)
])

但是这不起作用 - 我得到一个错误，说明structfiled不可交换。我也尝试使用Row（**）解压缩。它们都不起作用。看起来我编写结构类型的方式是错误的。理想情况下，我想要一个像

这样的数据帧

 id | results

697 | [1,1,0,1,5,4.3]

403 | [1,1,0.6,1,2,4.5]

485 | [1,1,0,1,0,9.3]

Answer 1

Spark arrays必须包含单个类型的值，list ints和floats并不满足此要求。由于模式推断仅检查第一个元素，因此它假定它是array<integer>并将所有其他值视为无效。

在转换为floats之前，您应该将所有值转换为DataFrame。

rdd = sc.parallelize([(697, [1,1,0,1,5,4.3])])
rdd.map(lambda x: (x[0], [float(v) for v in x[1]])).toDF()

如果您想保留混合类型，并且所有列表的大小相同，您可以使用tuple代替列表：

rdd.map(lambda x: (x[0], tuple(x[1]))).toDF()

在RDD到DF转换期间，浮点值被推断为null

1 个答案: