Question

我的csv文件的前7行有一些标题信息。实际的列名从第8行开始，因此如何在AWS Glue中跳过前7行？有什么主意吗？

Answer 1

您可以通过提供自定义架构并以其他方式读取文件来读取文件，例如：

schema=StructType([StructField('Id',StringType(), True),StructField('Name',StringType(), True)])

#now read your CSV as
df=spark.read.option("mode", "DROPMALFORMED").csv(path, schema=schema)

#now you will have dataframe rows which matches your schema only. If you still want to remove some top rows you can use

df=df.withColumn("Index",monotonically_increasing_id).filter('Index > 7).drop("Index")

据我所知，没有快捷方式。

如何为AWS Glue跳过csv中的前N行

1 个答案: