我有以下代码用于加载大量的" csv.gz"并将它们转储到源文件名为列的其他文件夹中。
object DailyMerger extends App {
def allFiles(path:File):List[File]= {
val parts=path.listFiles.toList.partition(_.isDirectory)
parts._2 ::: parts._1.flatMap(allFiles)
}
val sqlContext = SparkSession.builder().appName("DailyMerger").master("local").getOrCreate()
val files = allFiles(new File("/Logs/"))
.map(_.getAbsolutePath())
.filter(_.endsWith(".csv.gz"))
val df = sqlContext
.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true").load(files:_*)
.withColumn("SENSOR", input_file_name())
.write
.option("header", "true")
.option("compression", "gzip")
.csv("/tmp/out")
}
它的功能就像我的测试数据一样。但在我的真实"数据,我有很多文件包含':'以他们的名义。
当Hadoop尝试生成关联crc文件时,会导致以下异常:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: .ac:a3:1e:c6:5c:7c.csv.gz.crc
at org.apache.hadoop.fs.Path.initialize(Path.java:206)
at org.apache.hadoop.fs.Path.<init>(Path.java:172)
at org.apache.hadoop.fs.Path.<init>(Path.java:94)
at org.apache.hadoop.fs.ChecksumFileSystem.getChecksumFile(ChecksumFileSystem.java:90)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:145)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:346)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:766)
at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
at org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.<init>(HadoopFileLinesReader.scala:46)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$buildReader$2.apply(TextFileFormat.scala:105)
at org.apache.spark.sql.execution.datasources.text.TextFileFormat$$anonfun$buildReader$2.apply(TextFileFormat.scala:104)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:136)
at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:120)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:124)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.URISyntaxException: Relative path in absolute URI: .ac:a3:1e:c6:5c:7c.csv.gz.crc
at java.net.URI.checkPath(URI.java:1823)
at java.net.URI.<init>(URI.java:745)
at org.apache.hadoop.fs.Path.initialize(Path.java:203)
重命名输入文件不是一个选项,剩下的是什么?
答案 0 :(得分:3)
@ tzach-zohar说的并不多。关于尝试解决这个问题有很长的历史,这是一个非常重要的问题。
相关的JIRA是:
由于所有JIRA仍然是开放的,我会说重命名文件或使用除HDFS之外的其他东西是唯一的选择。
答案 1 :(得分:0)
我在 GCP 中使用 Jupyter 和 PySpark 时遇到了同样的问题。通过将文件路径保存到列表找到了一种解决方法:
path = "gs://path/*.csv" # can include filenames with with ':'
paths_file = "log_path_list.txt"
!gsutil ls -r $path > $paths_file
path_list = open(paths_file, 'r').read().split("\n")[:-1]
t = spark.read.csv(path_list)