在Spark中,是否可以在按列分区后的路径中添加后缀?
例如: 我将数据写入以下路径: / DB_NAME / TABLE_NAME / dateid = 20171009 / EVENT_TYPE = TEST /
`dataset.write().partitionBy("event_type").save("/db_name/table_name/dateid=20171009");`
是否可以使用动态分区将其创建为以下内容? / DB_NAME / TABLE_NAME / dateid = 20171009 / EVENT_TYPE = TEST / 1507764830
答案 0 :(得分:1)
事实证明newTaskTempFile
是正确的地方。前一个不适用于动态分区。
public String newTaskTempFile(TaskAttemptContext taskContext, Option<String> dir, String ext) {
Option<String> dirWithTimestamp = Option.apply(dir.get() + "/" + timestamp)
return super.newTaskTempFile(taskContext, dirWithTimestamp, ext);
}
答案 1 :(得分:0)
//sample json
{"event_type": "type_A", "dateid":"20171009", "data":"garbage" }
{"event_type": "type_B", "dateid":"20171008", "data":"garbage" }
{"event_type": "type_A", "dateid":"20171007", "data":"garbage" }
{"event_type": "type_B", "dateid":"20171006", "data":"garbage" }
// save as partition
spark.read
.json("./data/sample.json")
.write
.partitionBy("dateid", "event_type").saveAsTable("sample")
//result
答案 2 :(得分:0)
阅读源代码后,FileOutputCommitter
就是这样做的方法。
SparkSession spark = SparkSession
.builder()
.master("local[2]")
.config("spark.sql.parquet.output.committer.class", "com.estudio.spark.ESParquetOutputCommitter")
.config("spark.sql.sources.commitProtocolClass", "com.estudio.spark.ESSQLHadoopMapReduceCommitProtocol")
.getOrCreate();
ESSQLHadoopMapReduceCommitProtocol.realAppendMode = false;
spark.range(10000)
.withColumn("type", rand()
.multiply(6).cast("int"))
.write()
.mode(Append)
.partitionBy("type")
.format("parquet")
.save("/tmp/spark/test1/");
以下是自定义ParquetOutputCommitter
,它是自定义输出路径的地方。在这种情况下,我们为时间戳添加后缀。我们必须确保同步。这是代码:
import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.parquet.hadoop.ParquetOutputCommitter;
import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
@Slf4j
public class ESParquetOutputCommitter extends ParquetOutputCommitter {
private final static Map<String, Path> pathMap = new HashMap<>();
public final static synchronized Path getNewPath(final Path path) {
final String key = path.toString();
log.debug("path.key: {}", key);
if (pathMap.containsKey(key)) {
return pathMap.get(key);
}
final Path newPath = new Path(path, Long.toString(System.currentTimeMillis()));
pathMap.put(key, newPath);
log.info("---> Path: {}, newPath: {}", path, newPath);
return newPath;
}
public ESParquetOutputCommitter(Path outputPath, TaskAttemptContext context) throws IOException {
super(getNewPath(outputPath), context);
log.info("this: {}", this);
}
}
我们还可以使用getNewPath
方法获取自定义路径。到目前为止,这将适用于SaveMode.Overwrite
。
SaveMode.Append
稍有不同,请查看here。因此,为了覆盖Append
模式,我们需要覆盖SQLHadoopMapReduceCommitProtocol
以始终返回自定义ParquetOutputCommitter
。这是代码:
import lombok.extern.slf4j.Slf4j;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol;
import org.apache.spark.sql.internal.SQLConf;
import java.lang.reflect.Constructor;
@Slf4j
public class ESSQLHadoopMapReduceCommitProtocol extends SQLHadoopMapReduceCommitProtocol {
public static boolean realAppendMode = false;
private String jobId;
private String path;
private boolean isAppend;
public ESSQLHadoopMapReduceCommitProtocol(String jobId, String path, boolean isAppend) {
super(jobId, path, isAppend);
this.jobId = jobId;
this.path = path;
this.isAppend = isAppend;
}
@Override
public OutputCommitter setupCommitter(TaskAttemptContext context) {
try {
OutputCommitter committer = context.getOutputFormatClass().newInstance().getOutputCommitter(context);
if (realAppendMode) {
log.info("Using output committer class {}", committer.getClass().getCanonicalName());
return committer;
}
final Configuration configuration = context.getConfiguration();
final String key = SQLConf.OUTPUT_COMMITTER_CLASS().key();
final Class<? extends OutputCommitter> clazz;
clazz = configuration.getClass(key , null, OutputCommitter.class);
if (clazz == null) {
log.info("Using output committer class {}", committer.getClass().getCanonicalName());
return committer;
}
log.info("Using user defined output committer class {}", clazz.getCanonicalName());
if (FileOutputCommitter.class.isAssignableFrom(clazz)) {
Constructor<? extends OutputCommitter> ctor = clazz.getDeclaredConstructor(Path.class, TaskAttemptContext.class);
committer = ctor.newInstance(new Path(path), context);
} else {
Constructor<? extends OutputCommitter> ctor = clazz.getDeclaredConstructor();
committer = ctor.newInstance();
}
return committer;
} catch (Exception e) {
e.printStackTrace();
return super.setupCommitter(context);
}
}
}
还添加了静态标记realAppendMode
以关闭所有这些内容。
同样,我还不是Spark专家,请告诉我 e是此解决方案的任何问题。