我想在不同的路径下读取许多实木复合地板文件。 首先,我想列出一个包含所有路径的字符串列表:
\local\id_*_*\data\version.*\*
例如:
\local\id_231_2232318\data\version.501\part1.parquet
...
\local\id_7_456\data\version.502\part1.parquet
\local\id_7_456\data\version.502\part2.parquet
我该怎么做?
也许这会有所帮助,但这是一个不同的实现。 Spark read multiple directories into multiple dataframes
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{ FileSystem, Path }
val path = "\local\"
val hadoopConf = new Configuration()
val fs = FileSystem.get(hadoopConf)
val paths: Array[String] = fs.listStatus(new Path(path)).
filter(_.isDirectory).
map(_.getPath.toString)
val dfs: Array[DataFrame] = paths.
map(path => spark.read.parquet(path + "id_*_*\data\version.*\*"))
答案 0 :(得分:1)
嗯,我认为这是最快的解决方案,对我有用。输出不是单个实木复合地板。
val src_path = "hsfs:///local/id_*_*/data/version.*/*"
val df = spark.read.parquet(src_path)
df.write.parquet("hdfs:///destination/path/")