创建涉及ArrayType的Pyspark模式

时间:2018-01-23 05:19:10

标签: pyspark schema spark-dataframe rdd

我尝试为我的新DataFrame创建架构,并尝试了各种括号和关键字组合,但却无法弄清楚如何使其工作。我目前的尝试:

from pyspark.sql.types import *

schema = StructType([
  StructField("User", IntegerType()),
  ArrayType(StructType([
    StructField("user", StringType()),
    StructField("product", StringType()),
    StructField("rating", DoubleType())]))
  ])

回过头来看错误:

elementType should be DataType
Traceback (most recent call last):
 File "/usr/hdp/current/spark2-client/python/pyspark/sql/types.py", line 290, in __init__
assert isinstance(elementType, DataType), "elementType should be DataType"
AssertionError: elementType should be DataType   

我用google搜索,但到目前为止还没有一个对象数组的好例子。

1 个答案:

答案 0 :(得分:5)

StructField属性需要额外ArrayType。这应该工作:

from pyspark.sql.types import *

schema = StructType([
  StructField("User", IntegerType()),
  StructField("My_array", ArrayType(
      StructType([
          StructField("user", StringType()),
          StructField("product", StringType()),
          StructField("rating", DoubleType())
      ])
   )
])

有关详细信息,请访问以下链接:http://nadbordrozd.github.io/blog/2016/05/22/one-weird-trick-that-will-fix-your-pyspark-schemas/