我正在使用google dataflow阅读多个.gz文件进行处理。数据的最终目的地是BigQuery。 BigQuery表为.gz文件中的csv文件中的每个列都有专用列。 BQ表中有一个附加列为file_name,它给出了该记录所属的文件名。我正在使用TextIO.Read读取文件并对其进行ParDo转换。在DoFn中有一种方法可以识别传入字符串所属的文件名。
我的代码如下所示:
PCollection<String> logs = pipeline.apply(TextIO.Read.named("ReadLines")
.from("gcs path").withCompressionType(TextIO.CompressionType.AUTO));
PCollection<TableRow> formattedResults = logs.apply(ParDo.named("Format").of(new DoFn<String, TableRow>() {}
更新1:
我现在正在尝试如下:
PCollection<String> fileNamesCollection // this is collection of file names
GcsIOChannelFactory channelFactory = new GcsIOChannelFactory(options.as(GcsOptions.class));
PCollection<KV<String,String>> kv = fileNamesCollection.apply(ParDo.named("Format").of(new DoFn<String, KV<String,String>>() {
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception {
ReadableByteChannel readChannel = channelFactory.open(c.element());
GZIPInputStream gzip = new GZIPInputStream(Channels.newInputStream(readChannel));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));
String line = null;
while ((line = br.readLine()) != null) {
c.output(KV.of(c.element(), line));
}
}
}));
但是当我运行这个程序时,我觉得channelFactory不可序列化,我有任何一个实现Serializable接口的通道工厂,可以在这里使用。
更新2:我终于能够执行程序并成功提交作业。感谢jkff的帮助。 下面是我的最终代码,我将它粘贴在这里,以便它对其他人也有帮助。
ProcessLogFilesOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(ProcessLogFilesOptions.class); // ProcessLogFilesOptions is a custom class
DataflowWorkerLoggingOptions loggingOptions = options.as(DataflowWorkerLoggingOptions.class);
loggingOptions.setDefaultWorkerLogLevel(Level.WARN);
String jobName = "unique_job_name";
options.as(BlockingDataflowPipelineOptions.class).setJobName(jobName);
Pipeline pipeline = Pipeline.create(options);
List<String> filesToProcess = new ArrayList<String>();
for(String fileName : fileNameWithoutHrAndSuffix) { // fileNameWithoutHrAndSuffix has elements like Log_20160921,Log_20160922 etc
filesToProcess.addAll((new GcsIOChannelFactory(options.as(GcsOptions.class))).match(LogDestinationStoragePath+fileName));
}
// at this time filesToProcess will have all logs files name as Log_2016092101.gz,Log_2016092102.gz,.........,Log_2016092201.gz,Log_2016092223.gz
PCollection<String> fileNamesCollection = pipeline.apply(Create.of(filesToProcess));
PCollection<KV<String,String>> kv = fileNamesCollection.apply(ParDo.named("Parsing_Files").of(new DoFn<String, KV<String,String>>() {
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception {
// I have to create _options here because Options, GcsIOChannelFactory are non serializable
ProcessLogFilesOptions _options = PipelineOptionsFactory.as(ProcessLogFilesOptions.class);
GcsIOChannelFactory channelFactory = new GcsIOChannelFactory(_options.as(GcsOptions.class));
ReadableByteChannel readChannel = channelFactory.open(c.element());
GZIPInputStream gzip = new GZIPInputStream(Channels.newInputStream(readChannel));
BufferedReader br = new BufferedReader(new InputStreamReader(gzip));
String line = null;
while ((line = br.readLine()) != null) {
c.output(KV.of(c.element(), line));
}
br.close();
gzip.close();
readChannel.close();
}
}));
// Performing reshuffling here as suggested
PCollection <KV<String,String>> withFileName = kv.apply(Reshuffle.<String, String>of());
PCollection<TableRow> formattedResults = withFileName
.apply(ParDo.named("Generating_TableRow").of(new DoFn<KV<String,String>, TableRow>() {
private static final long serialVersionUID = 1L;
@Override
public void processElement(ProcessContext c) throws Exception {
KV<String,String> kv = c.element();
String logLine = kv.getValue();
String logFileName = kv.getKey();
// do further processing as you want here
}));
// Finally insert in BQ table the formattedResults
答案 0 :(得分:1)
现在,答案是否定的。如果您需要访问文件名,遗憾的是,在这种情况下,您最好的选择是自己实现文件模式扩展和文件解析(作为ParDo
)。以下是您需要记住的一些事项:
ParDo
之前插入redistribute到prevent excessive fusion。GcsIoChannelFactory
展开文件模式(请参阅this question中的示例)并打开ReadableByteChannel
。使用Channels.newInputStream创建InputStream
,然后将其包装到Java的标准GZipInputStream
中并逐行阅读 - 有关示例,请参阅this question。请记住close the streams。或者,您可以考虑编写自己的file-based source。但是,在这种特殊情况下(.gz文件),我建议不要使用它,因为该API主要用于可以从任何偏移量随机访问的文件。