Question

我正在尝试编写一个XML文件，其中源是存储在GCS中的文本文件。代码运行正常但不是单个XML文件，而是生成多个XML文件。（XML文件的数量似乎遵循源文本文件中存在的总记录数）。我在使用'DataflowRunner'时观察到了这种情况。

当我在本地运行相同的代码时，会生成两个文件。第一个包含具有适当元素的所有记录，第二个包含仅包含开始和结束根元素。

有关这种意外行为发生的任何想法吗？请在下面找到我正在使用的代码段：

PCollection<String>input_records=p.apply(TextIO.read().from("gs://balajee_test/xml_source.txt"));

   PCollection<XMLFormatter> input_object= input_records.apply(ParDo.of(new DoFn<String,XMLFormatter>(){
        @ProcessElement

        public void processElement(ProcessContext c)
        {
            String elements[]=c.element().toString().split(",");

            c.output(new XMLFormatter(elements[0],elements[1],elements[2],elements[3],elements[4]));

            System.out.println("Values to be written have been provided to constructor ");

        }
    })).setCoder(AvroCoder.of(XMLFormatter.class));

   input_object.apply(XmlIO.<XMLFormatter>write()
              .withRecordClass(XMLFormatter.class)
              .withRootElement("library")
              .to("gs://balajee_test/book_output"));

请告诉我在输出处生成单个XML文件（book_output.xml）的方法。

Answer 1

/** * Writes to files with the given path prefix. * * <p>Output files will have the name {@literal {filenamePrefix}-0000i-of-0000n.xml} where n is * the number of output bundles. */记录如下：

.withoutSharding()

即。预计它可能会产生多个文件：例如如果跑步者选择处理您的数据并将其并行化为3个任务（“捆绑”），您将获得3个文件。在某些情况下，某些部分可能会变空，但写入的总数据总是会增加到预期的数据。

如果您的数据不是特别大，请求IO生成一个文件是合理的请求。它通过follow_redirects: true在TextIO和AvroIO中受支持，但在XmlIO中尚不支持。请随时通过功能请求deepcopy获取。

通过Apache Beam写入XML时生成多个文件

1 个答案: