我的下一个问题不是新问题,但我想逐步了解它。
在Spark应用程序中,我创建DataFrame。让我们称之为 return next.handle(request).pipe(
catchError(error => {
if (error instanceof HttpErrorResponse) {
// notify user or perfome actions
}
return of([]); // return empty Observable of array
}),
map((event: any) => {
console.log(event.status);
return event;
});
。 Spark版本:df
2.4.0
如何从此DataFrame创建val df: DataFrame = Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
文件并将CSV文件放入服务器中的特定文件夹?
例如,此代码正确吗?我注意到有些人为此任务使用.csv
或coalesce
。但是我不知道哪种情况会更好。
repartition
当我尝试使用下一个代码时,它会引发union.write
.format("com.databricks.spark.csv")
.option("header", "true")
.save("/home/reports/")
:
ERROR
我以org.apache.hadoop.security.AccessControlException: Permission denied: user=root, access=WRITE, inode="/home/reports/_temporary/0":hdfs:hdfs:drwxr-xr-x
用户身份运行Spark应用程序。 root
用户使用以下命令创建的reports
文件夹:
root
似乎只有mkdir -m 777 reports
个用户可以写入文件。
答案 0 :(得分:2)
我相信您对 Spark 的行为感到困惑,建议您先阅读官方文档和/或一些教程。
不过,我希望这能回答您的问题。
此代码会将DataFrame
保存为本地文件系统上的单个 CSV 文件。
它已在 Ubuntu 2.4.0
笔记本电脑上通过 Spark 2.12.8
和 Scala 18.04
进行了测试
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.appName("CSV Writter Test")
.getOrCreate()
import spark.implicits._
val df =
Seq(
("Alex", "2018-01-01 00:00:00", "2018-02-01 00:00:00", "OUT"),
("Bob", "2018-02-01 00:00:00", "2018-02-05 00:00:00", "IN"),
("Mark", "2018-02-01 00:00:00", "2018-03-01 00:00:00", "IN"),
("Mark", "2018-05-01 00:00:00", "2018-08-01 00:00:00", "OUT"),
("Meggy", "2018-02-01 00:00:00", "2018-02-01 00:00:00", "OUT")
).toDF("NAME", "START_DATE", "END_DATE", "STATUS")
df.printSchema
// root
// |-- NAME: string (nullable = true)
// |-- START_DATE: string (nullable = true)
// |-- END_DATE: string (nullable = true)
// |-- STATUS: string (nullable = true)
df.coalesce(numPartitions = 1)
.write
.option(key = "header", value = "true")
.option(key = "sep", value = ",")
.option(key = "encoding", value = "UTF-8")
.option(key = "compresion", value = "none")
.mode(saveMode = "OVERWRITE")
.csv(path = "file:///home/balmungsan/dailyReport/") // Change the path. Note there are 3 /, the first two are for the file protocol, the third one is for the root folder.
spark.stop()
现在,让我们检查保存的文件。
balmungsan@BalmungSan:dailyReport $ pwd
/home/balmungsan/dailyReport
balmungsan@BalmungSan:dailyReport $ ls
part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv _SUCCESS
balmungsan@BalmungSan:dailyReport $ cat part-00000-53a11fca-7112-497c-bee4-984d4ea8bbdd-c000.csv
NAME,START_DATE,END_DATE,STATUS
Alex,2018-01-01 00:00:00,2018-02-01 00:00:00,OUT
Bob,2018-02-01 00:00:00,2018-02-05 00:00:00,IN
Mark,2018-02-01 00:00:00,2018-03-01 00:00:00,IN
Mark,2018-05-01 00:00:00,2018-08-01 00:00:00,OUT
Meggy,2018-02-01 00:00:00,2018-02-01 00:00:00,OUT
存在_SUCCESS
文件以表明写入成功。
file://
协议以保存到本地文件系统,而不是保存在 HDFS 中。