Question

我是Pyspark的新手，我一直在努力尝试完成一些我认为很简单的事情。我正在尝试将Esv文件转换为镶木地板文件的ETL过程。 CSV文件有几列简单的列，但是一列是定界的整数数组，我想将它们扩展/解压缩到一个镶木地板文件中。 .net核心微服务实际上使用此Parquet文件，该服务使用Parquet Reader进行下游计算。为使此问题保持简单，该列的结构为：

“ geomap” 5：3：7 | 4：2：1 | 8：2：78->这表示3个项目的数组，在“ |”处拆分然后用值（5,3,7），（4,2,1），（8,2,78）构建元组

我尝试了各种流程和模式，但我无法正确理解。通过UDF，我正在创建列表列表或元组列表，但是我无法获得正确的架构或将数据解压缩到镶木地板写操作中。我或者得到空值，错误或其他问题。我需要采取不同的方法吗？相关代码如下。我只是为了简单起见显示问题专栏，因为其余的工作都在进行。这是我第一次尝试Pyspark，因此很抱歉缺少明显的内容：

def convert_geo(geo):
   return [tuple(x.split(':')) for x in geo.split('|')]

compression_type = 'snappy'

schema = ArrayType(StructType([
    StructField("c1", IntegerType(), False),
    StructField("c2", IntegerType(), False),
    StructField("c3", IntegerType(), False)
]))

spark_convert_geo = udf(lambda z: convert_geo(z),schema)

source_path = '...path to csv'
destination_path = 'path for generated parquet file'

df = spark.read.option('delimiter',',').option('header','true').csv(source_path).withColumn("geomap",spark_convert_geo(col('geomap')).alias("geomap"))
df.write.mode("overwrite").format('parquet').option('compression', compression_type).save(destination_path)

编辑：每个请求添加printSchema（）输出，我也不知道这里有什么问题。我仍然似乎无法使字符串拆分值正确显示或呈现。这包含所有列。我确实看到了c1和c2和c3结构名称...

root |-- lrsegid: integer (nullable = true) |-- loadsourceid: integer (nullable = true) |-- agencyid: integer (nullable = true) |-- acres: float (nullable = true) |-- sourcemap: array (nullable = true) | |-- element: integer (containsNull = true) |-- geomap: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- c1: integer (nullable = false) | | |-- c2: integer (nullable = false) | | |-- c3: integer (nullable = false)

Answer 1

问题是convert_geo函数返回一个带有字符元素的元组列表，而不是模式中指定的整数。如果您进行如下修改，它将起作用：

def convert_geo(geo):
    return [tuple([int(y) for y in x.split(':')]) for x in geo.split('|')]

Pyspark的新功能-导入CSV并创建具有阵列列的镶木地板文件

1 个答案: