Question

我有一个字典列表

说

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

第一行有两列column1，column2

第二行有两列added_column1, column2

我想基于数据创建一个火花数据框，并且应该随着列表的更改而更改

有长期解决方案吗？

当前

spark.createDataFrame(list_).show()

这行得通，但我收到此警告。

UserWarning：不建议从dict推断架构，请使用 pyspark.sql.Row相反warnings.warn（“从dict推断架构为已弃用，”

Answer 1

您可以在RDD上使用toDF()函数，并指定在转换为数据框时用于推断架构的样本比例。

list_ = [
 {u'column1': u'test1', u'column2': u'None'},
 {u'added_column1': u'test2', u'column2': u'None'}]

sc.parallelize(list_).toDF(sampleRatio=0.9).show()

使用行（从dict创建）创建数据框要求所有行的列数均相同

spark.createDataFrame(list(map(lambda x: Row(**x), list_))).show()

上面的代码将给您错误： Input row doesn't have expected number of values required by the schema. 3 fields are required while 2 values are provided.

根据不同结构的词典列表创建spark数据框

1 个答案: