我正在解析来自pubsub的日志,目的是将这些日志放在自定义位置的每小时文件中,这也是基于日志时间戳(pubsub日志中的字段)。
文件应该获取特定时间的所有数据。文件应该每小时继续附加。 例如GS://bucket/applog/2017-09-27/application1/app-2017-09-27-11H.log
pushFilePColl.apply(Window.into(new FileTextIOWindowFn())) .apply("FileTO to LOG TextIO", ParDo.of(new TextIOWriteDoFn())) .apply(TextIO.write().to(pipelineOptions.getFileStorageBucket()).withWindowedWrites() .withFilenamePolicy(new FileStorageFileNamePolicy(logTypeEnum)).withNumShards(10));
自定义窗口:
public class FileTextIOWindowFn extends NonMergingWindowFn<Object, IntervalWindow> {
/**
*
*/
private static final long serialVersionUID = 1L;
private IntervalWindow assignWindow(AssignContext context) {
FilePushTO filePushTO = (FilePushTO) context.element();
String timestamp = filePushTO.getLogTime();
DateTimeFormatter formatter = DateTimeFormat.forPattern(CommonConstants.DATE_FORMAT_YYYYMMDD_HHMMSS_SSS)
.withZoneUTC();
Instant start_point = Instant.parse(timestamp, formatter);
Calendar cal = DateUtil.getCurrentDateInUTC();
SimpleDateFormat DATE_FORMATER_PARTITION_NAME = DateUtil.getDateFormater();
Instant end_point = Instant.parse(DATE_FORMATER_PARTITION_NAME.format(cal.getTime()), formatter);
return new IntervalWindow(start_point, end_point);
};
@Override
public Coder<IntervalWindow> windowCoder() {
return IntervalWindow.getCoder();
}
@Override
public Collection<IntervalWindow> assignWindows(AssignContext c) throws Exception {
return Arrays.asList(assignWindow(c));
}
@Override
public boolean isCompatible(WindowFn<?, ?> other) {
return false;
}
@Override
public WindowMappingFn<IntervalWindow> getDefaultWindowMappingFn() {
throw new IllegalArgumentException(
"Attempted to get side input window for GlobalWindow from non-global WindowFn");
}
}
文件名政策:
public class FileStorageFileNamePolicy extends FileBasedSink.FilenamePolicy {
/**
*
*/
private static final long serialVersionUID = 1L;
private static Logger LOGGER = LoggerFactory.getLogger(FileStorageFileNamePolicy.class);
private LogTypeEnum logTypeEnum;
public FileStorageFileNamePolicy(LogTypeEnum logTypeEnum) {
this.logTypeEnum = logTypeEnum;
}
@Override
public ResourceId windowedFilename(ResourceId outputDirectory, WindowedContext context, String extension) {
IntervalWindow window = (IntervalWindow) context.getWindow();
String startDate = window.start().toString();
String dateString = startDate.replace("T", CommonConstants.SPACE)
.replaceAll(startDate.substring(startDate.indexOf("Z")), CommonConstants.EMPTY_STRING);
String startDateHour = startDate;
try {
startDate = DateUtil.getDateForFileStore(dateString, null);
startDateHour = DateUtil.getDTLocalTZHour(dateString, null);
} catch (ParseException e) {
LOGGER.error("Error converting date : {}", e);
}
String filename = new StringBuilder(window.start().toString()).append(CommonConstants.COLON)
.append(startDateHour).append(CommonConstants.UNDER_SCORE).append(context.getShardNumber())
.append(".txt").toString();
String dirName = new StringBuilder(startDate).append(CommonConstants.FORWARD_SLASH)
.append(logTypeEnum.getValue().toLowerCase()).append(CommonConstants.FORWARD_SLASH).toString();
LOGGER.info("Directory : {} and File Name : {}", dirName, filename);
return outputDirectory.resolve(dirName, ResolveOptions.StandardResolveOptions.RESOLVE_DIRECTORY)
.resolve(filename, ResolveOptions.StandardResolveOptions.RESOLVE_FILE);
}
@Override
public ResourceId unwindowedFilename(ResourceId outputDirectory, Context context, String extension) {
throw new UnsupportedOperationException("Unsupported.");
}
}
我使用Interval窗口创建了customWindow,以便在FileNamePolicy中我可以获得适当的时间戳。我不能使用fixedWindow,因为它总会给我当前的时间戳。
在这里,一切都很完美,但文件无法附加。他们被覆盖了。
答案 0 :(得分:1)
您可以使用Beam 2.1中提供的TextIO.write().to(...).withWindowedWrites()
执行此操作。请参阅TextIO javadoc。