Question

在PySpark（v1.6.2）中将RDD转换为具有指定架构的DataFrame时，值类型与字段中声明的字段不匹配的字段将转换为null

from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import StructType, StructField, DoubleType

sc = SparkContext()
sqlContext = SQLContext(sc)

schema = StructType([
    StructField("foo", DoubleType(), nullable=False)
])

rdd = sc.parallelize([{"foo": 1}])
df = sqlContext.createDataFrame(rdd, schema=schema)

print df.show()

+----+
| foo|
+----+
|null|
+----+

这是一个PySpark错误还是只是非常令人惊讶但有意的行为？我希望引发TypeError或将int转换为与float兼容的DoubleType。

Answer 1

这是一种预期的行为。特别是请参阅corresponding part of the source的评论：

// all other unexpected type should be null, or we will have runtime exception
// TODO(davies): we could improve this by try to cast the object to expected type
case (c, _) => null

PySpark SQLContext.createDataFrame在声明的和实际的字段类型不匹配时产生空值

1 个答案: