Question

我正在使用pyspark（1.6）和elasticsearch-hadoop（5.1.1）。我通过以下方式将弹性搜索的数据转换为rdd格式：

es_rdd = sc.newAPIHadoopRDD(                                               
    inputFormatClass="org.elasticsearch.hadoop.mr.EsInputFormat",          
    keyClass="org.apache.hadoop.io.NullWritable",                          
    valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",     
    conf=es_read_conf)

这里es_read_conf只是我的ES集群的字典，就像SparkContext对象一样。这工作正常，我得到了rdd对象。

我想使用

将其转换为数据框

df = es_rdd.toDF()

但是我收到了错误：

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

为toDF方法提供sampleSize会导致相同的错误。根据我的理解，这是因为pyspark无法确定每个字段的类型。我知道我的elasticsearch集群中有些字段都是null。

将此转换为数据框的最佳方法是什么？

Answer 1

告诉您要转换为Spark类型数据的最佳方式。请参阅createDataFrame的文档，其中包含第五个示例（内部为StructType的

}

Pyspark将rdd转换为带空值的数据帧

1 个答案: