我有一个CSV文件,其标题和数据如下:
Date,Transaction,Name,Memo,Amount
12/31/2018,DEBIT,Amazon stuff,24000978364666403396802,-62.48
我想覆盖列名称,如下所示:
transaction,credit_debit,description,memo,amount
这是我手动指定要使用的架构然后读取文件的方式:
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.TimestampType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
String csvPath = "input/mytransactions.csv";
DataFrameReader dataFrameReader = spark.read();
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);
dataFrame.show(20);
但是,当我读取文件时,实际列值将为null。
+----------------+------------+-----------+----+------+
|transaction_date|credit_debit|description|memo|amount|
+----------------+------------+-----------+----+------+
| null| null| null|null| null|
| null| null| null|null| null|
| null| null| null|null| null|
知道我做错了什么吗?
答案 0 :(得分:0)
问题与日期列有关,您在csv上缺少名为dateFormat的选项。 下面的代码。
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("transaction_date", DataTypes.DateType, true),
DataTypes.createStructField("credit_debit", DataTypes.StringType, true),
DataTypes.createStructField("description", DataTypes.StringType, true),
DataTypes.createStructField("memo", DataTypes.StringType, true),
DataTypes.createStructField("amount", DataTypes.DoubleType, true)
});
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("dateFormat", "MM/dd/YYYY")
.option("inferSchema", false)
.schema(schema)
.csv(csvPath);
答案 1 :(得分:0)
我想重命名列。做到了:
Dataset<Row> dataFrame =
dataFrameReader
.format("org.apache.spark.csv")
.option("header","true")
.option("inferSchema", true)
.csv(csvPath);
// Rename Columns
dataFrame = dataFrame.toDF("transaction_date","debit_credit", "description", "memo", "amount");