Question

我正在收到Kafka的流媒体数据。默认情况下，dataframe.value为＆＃34; string＆＃34;类型。例如，dataframe.value是

1.0,2.0,4,'a'
1.1,2.1,3,'a1'

dataframe.value的架构：

root
 |-- value: string (nullable = true)

现在我想在这个数据框架上定义一个架构。我想要获得输出的模式：

root
 |-- c1: double (nullable = true) 
 |-- c2: double (nullable = true)
 |-- c3: integer (nullable = true)
 |-- c4: string (nullable = true)

我定义了架构然后从kafka加载数据但是我得到错误＆＃34; Kafka已经定义了架构不能应用自定义的架构＆＃34;。

对此问题的任何帮助都将受到高度赞赏。

Answer 1

您可以在转换为数据框时定义架构。

from pyspark.sql.types import StringType, IntegerType, DoubleType
kafkaRdd = sc.parallelize([(1.0,2.0,4,'a'), (1.1,2.1,3,'a1')])
col_types = [DoubleType(), DoubleType(), IntegerType(), StringType()]
col_names = ["c1", "c2", "c3", "c4"]
df = kafkaRdd.toDF(col_names, col_types)
df.show()
df.printSchema()

这是输出：

+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
|1.0|2.0|  4|  a|
|1.1|2.1|  3| a1|
+---+---+---+---+

root
 |-- c1: double (nullable = true)
 |-- c2: double (nullable = true)
 |-- c3: long (nullable = true)
 |-- c4: string (nullable = true)

将pyspark数据帧值转换为自定义架构

1 个答案: