使用TextIO.Read

时间:2016-12-26 07:13:15

标签: google-cloud-dataflow

那里!我是Cloud-DataFlow的新手。

我使用 DataflowPipelineRunner 来读取csv文件并将结果输出到BigQuery。当csv文件的大小很小时(仅20条记录,小于1MB),但在文件大小变大(超过1000万条记录,大约616.42 MB)时出现了OOM错误。

以下是错误消息:

  

java.lang.OutOfMemoryError:Java堆空间       at java.util.Arrays.copyOf(Arrays.java:3236)       在java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)       at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)       at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)       在com.google.cloud.dataflow.sdk.util.StreamUtils.getBytes(StreamUtils.java:63)        at co.coder.MyCoder.decode(MyCoder.java:54)       at co.coder.MyCoder.decode(MyCoder.java:1)       在com.google.cloud.dataflow.sdk.io.TextIO $ TextSource $ TextBasedReader.decodeCurrentElement(TextIO.java:1065)       在com.google.cloud.dataflow.sdk.io.TextIO $ TextSource $ TextBasedReader.readNextRecord(TextIO.java:1052)       在com.google.cloud.dataflow.sdk.io.FileBasedSource $ FileBasedReader.advanceImpl(FileBasedSource.java:536)       在com.google.cloud.dataflow.sdk.io.OffsetBasedSource $ OffsetBasedReader.advance(OffsetBasedSource.java:287)       在com.google.cloud.dataflow.sdk.runners.worker.WorkerCustomSources $ BoundedReaderIterator.advance(WorkerCustomSources.java:541)       在com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation $ SynchronizedReaderIterator.advance(ReadOperation.java:425)       在com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:217)       在com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)       在com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.executeWork(DataflowWorker.java:284)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.doWork(DataflowWorker.java:220)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorker.getAndPerformWork(DataflowWorker.java:170)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness $ WorkerThread.doWork(DataflowWorkerHarness.java:192)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness $ WorkerThread.call(DataflowWorkerHarness.java:172)       在com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerHarness $ WorkerThread.call(DataflowWorkerHarness.java:159)       at java.util.concurrent.FutureTask.run(FutureTask.java:266)       在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)       at java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)       在java.lang.Thread.run(Thread.java:745)

从错误消息中,[MyCoder.java:54]中发生错误.MyCoder是我实现的CustomCoder的子类,它将把DSL-JIS中的csv文件编码为UTF-8:

53:@Override
54:public String decode(InputStream inStream, Context context) throws CoderException, IOException {
55:    if (context.isWholeStream) {
56:        byte[] bytes = StreamUtils.getBytes(inStream);
57:        return new String(bytes, Charset.forName("Shift_JIS"));
58:    } else {
59:        try {
60:            return readString(new DataInputStream(inStream));
61:        } catch (EOFException | UTFDataFormatException exn) {
62:            // These exceptions correspond to decoding problems, so change
63:            // what kind of exception they're branded as.
64:            throw new CoderException(exn);
65:        }
66:    }
67:}

另外,这是我运行DataflowPipelineRunner的方式:

DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions.class);
    options.setRunner(DataflowPipelineRunner.class);
    options.setProject(projectId);
    options.setStagingLocation(stagingFolderPathInGCS);
    options.setWorkerMachineType("n1-highmem-4");
    options.setMaxNumWorkers(5);
    Pipeline p = Pipeline.create(options);
    // read csv from gcs
    PCollection<String> lines = p.apply(TextIO.Read.named("csv input")
            .from("gs://" + bucketName + "/original/" + fileName).withCoder(MyCoder.of()));
    lines.apply(TextIO.Write.named("csv output").to("gs://" + bucketName + "/encoded/" + fileName)
            .withCoder(StringUtf8Coder.of()).withoutSharding().withHeader("test Header"));
p.run();

由于Dataflow是针对大数据的可扩展云服务,所以我对这个OOM错误感到有点困惑,有人可以向我解释为什么[OutOfMemoryError]发生了以及如何解决它?

非常感谢!

1 个答案:

答案 0 :(得分:1)

我没有安静地理解但是解决了下面这个问题:

  

但是在文件大小变大(超过1000万次)时出现了OOM错误   记录,约616.42 MB)。

那是因为我只是通过处理较小的文件(只有20条记录,小于1MB)来制作测试数据,而另一方面,1000万条数据只有20个密钥。 所以我改变了另一个有很多键的测试数据(没有太多的反复数据)。

此外,我遵循Kenn Knowles的建议,让数据流通过删除以下代码自动管理它的工作和实例:

withoutSharding()
options.setWorkerMachineType("n1-highmem-4");

Finnaly数据流作业运行良好(MachineType自动使用n1-standard-1)!

有关数据流[动态工作重新平衡]的更多信息,请参见: https://cloud.google.com/dataflow/service/dataflow-service-desc#Autotuning