Question

在我的项目中，我必须具有处理过的数据的代理密钥（从csv读取后，进行验证->数据清理-> df富集），然后再将其存储到HDFS中。我正在使用zipWithIndex来做到这一点。它工作得很好，符合我的要求。但是，它确实将结果df上的数据类型从IntegerType更改为LongType。好吧，我可以和LongType一起生活，但是很好奇为什么/正在发生什么。

我在下面编写了函数（简化为在此处发布），以为给定的输入df附加代理键

>>def attachSGK(p_df, p_offset=1, p_colName="sgk"):
    lv_cols = p_df.columns
    lv_cols.insert(0, p_colName)
    lv_zipped_rdd = p_df.rdd.zipWithIndex()      
    lv_new_rdd = lv_zipped_rdd.map(lambda row : (row[1]+p_offset,) + tuple(row[0])).toDF(lv_cols)
    return lv_new_rdd

>>oneDF.printSchema()

root  
  |-- trade_id: string (nullable = true)  
  |-- trade_version_id: integer (nullable = true)

>> oneDF.groupBy("trade_version_id").count().show(10,False)

+----------------+-----+
|trade_version_id|count|
+----------------+-----+
|1               |10   |
+----------------+-----+

>>resDF = attachSGK(oneDF,123,"trade_sgk")

>>reDF.printSchema()

root
 |-- trade_sgk: long (nullable = true)
 |-- trade_id: string (nullable = true)
 |-- trade_version_id: long (nullable = true)

在此发布之前，我已经花了一些时间在一个网站上搜索google，当未指定架构时，它会在给定的df上进行抽样，并为结果df确定决策的架构/类型。如果是这种情况，那么考虑使用LongType，trade_version_id的价值就不会太大。

希望我能从专家那里得到一些线索。

〜巴拉吉

zipwithindex之后数据类型发生了变化

0 个答案: