我有一个包含99个文件的目录,我想读取这些文件,然后将其哈希成sha256校验和。我最终希望将它们输出到具有键值对的JSON文件中,例如(文件1,092180x0123)。目前,我在将ParDo函数传递给可读文件时遇到了麻烦,我一定很容易丢失一些东西。这是我第一次使用Apache Beam,因此任何帮助都将是惊人的。这是我到目前为止所拥有的
public class BeamPipeline {
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p
.apply("Match Files", FileIO.match().filepattern("../testdata/input-*"))
.apply("Read Files", FileIO.readMatches())
.apply("Hash File",ParDo.of(new DoFn<FileIO.ReadableFile, KV<FileIO.ReadableFile, String>>() {
@ProcessElement
public void processElement(@Element FileIO.ReadableFile file, OutputReceiver<KV<FileIO.ReadableFile, String>> out) throws
NoSuchAlgorithmException, IOException {
// File -> Bytes
String strfile = file.toString();
byte[] byteFile = strfile.getBytes();
// SHA-256
MessageDigest md = MessageDigest.getInstance("SHA-256");
byte[] messageDigest = md.digest(byteFile);
BigInteger no = new BigInteger(1, messageDigest);
String hashtext = no.toString(16);
while(hashtext.length() < 32) {
hashtext = "0" + hashtext;
}
out.output(KV.of(file, hashtext));
}
}))
.apply(FileIO.write());
p.run();
}
}
答案 0 :(得分:1)
一个包含匹配文件名(来自MetadataResult
的KV对和整个文件的对应SHA-256(而不是逐行读取)的示例:
p
.apply("Match Filenames", FileIO.match().filepattern(options.getInput()))
.apply("Read Matches", FileIO.readMatches())
.apply(MapElements.via(new SimpleFunction <ReadableFile, KV<String,String>>() {
public KV<String,String> apply(ReadableFile f) {
String temp = null;
try{
temp = f.readFullyAsUTF8String();
}catch(IOException e){
}
String sha256hex = org.apache.commons.codec.digest.DigestUtils.sha256Hex(temp);
return KV.of(f.getMetadata().resourceId().toString(), sha256hex);
}
}
))
.apply("Print results", ParDo.of(new DoFn<KV<String, String>, Void>() {
@ProcessElement
public void processElement(ProcessContext c) {
Log.info(String.format("File: %s, SHA-256: %s ", c.element().getKey(), c.element().getValue()));
}
}
));
完整代码here。在我的情况下,输出为:
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file1, SHA-256: e27cf439835d04081d6cd21f90ce7b784c9ed0336d1aa90c70c8bb476cd41157
Apr 21, 2019 10:02:21 PM com.dataflow.samples.DataflowSHA256$2 processElement
INFO: File: /home/.../data/file2, SHA-256: 72113bf9fc03be3d0117e6acee24e3d840fa96295474594ec8ecb7bbcb5ed024
我使用在线哈希tool进行了验证:
顺便说一句,我认为您不需要OutputReceiver
来获得单个输出(无侧面输出)。感谢以下有用的问题/解答:1,2,3。