Question

我正在使用具有预定义架构的pyspark读取csv文件。

schema = StructType([
StructField("col1", IntegerType(), True),
StructField("col2", StringType(), True)
StructField("col3", FloatType(), True)
])

df = spark.sqlContext.read
    .schema(schema)
    .option("header",true)
    .option("delimiter", ",")
    .csv(path)

现在在csv文件中，col1中有浮点值，而col3中有字符串值。我需要引发一个异常并获取这些列的名称（col1，col3），因为这些列包含的数据类型的值与架构中定义的数据类型不同。

我该如何实现？

Answer 1

在pyspark版本> 2.2中，您可以将columnNameOfCorruptRecord与csv结合使用：

schema = StructType(
    [
        StructField("col1", IntegerType(), True),
        StructField("col2", StringType(), True),
        StructField("col3", FloatType(), True),
        StructField("corrupted", StringType(), True),
    ]
)

df = spark.sqlContext.read.csv(
    path,
    schema=schema,
    header=True,
    sep=",",
    mode="PERMISSIVE",
    columnNameOfCorruptRecord="corrupted",
).show()

+----+----+----+------------+
|col1|col2|col3|   corrupted|
+----+----+----+------------+
|null|null|null|0.10,123,abc|
+----+----+----+------------+

编辑：CSV记录字段不是彼此独立的，因此通常不能说一个字段已损坏，而其他字段则不是。只有整个记录可以损坏或可以不损坏。

例如，假设我们有一个用逗号分隔的文件，其中包含一行和两个浮点列，即欧元值0,10和1,00。该文件如下所示：

col1,col2
0,10,1,00

哪个字段已损坏？

使用pyspark读取csv文件时获取格式错误的记录的列名

1 个答案: