Question

我正在读取具有5万行数据的文本文件，其中每一行代表一个完整的记录。

我们的Nifi流正在利用SplitText来批量处理1000行文件。（这是在我遇到内存问题之前设置的）

是否可以立即执行PutFile？我希望文件一旦完成就立即删除PutFile记录，而不仅仅是坐在队列中等待所有50k +数据行都已处理。如果将其拆分，似乎很愚蠢。

我正在阅读文档，但是我找不到这是设计使然并且不可配置的。

赞赏所有有助于回答/配置流程的文档指南。

Answer 1

TL; DR一种解决方法是使用多个SplitText，例如，第一个拆分为10k行，然后第二个拆分为1000行。然后，前10k行将被分成10个流文件，并向下游发送，而第二个SplitText正在处理后10k行。

编辑：添加了另一个解决方法，即将在InvokeScriptedProcessor中使用的Groovy脚本：

class GroovyProcessor implements Processor {
    def REL_SUCCESS = new Relationship.Builder().name("success").description('FlowFiles that were successfully processed are routed here').build()
    def REL_FAILURE = new Relationship.Builder().name("failure").description('FlowFiles that were not successfully processed are routed here').build()
    def REL_ORIGINAL = new Relationship.Builder().name("original").description('After processing, the original incoming FlowFiles are routed here').build()
    def ComponentLog log

    void initialize(ProcessorInitializationContext context) { log = context.logger }
    Set<Relationship> getRelationships() { return [REL_FAILURE, REL_SUCCESS, REL_ORIGINAL] as Set }
    Collection<ValidationResult> validate(ValidationContext context) { null }
    PropertyDescriptor getPropertyDescriptor(String name) { null }
    void onPropertyModified(PropertyDescriptor descriptor, String oldValue, String newValue) { }
    List<PropertyDescriptor> getPropertyDescriptors() { null }
    String getIdentifier() { null }    
    void onTrigger(ProcessContext context, ProcessSessionFactory sessionFactory) throws ProcessException {
        def session1 = sessionFactory.createSession()
        def session2 = sessionFactory.createSession()
        try {
            def inFlowFile = session1.get()
            if(!inFlowFile) return
            def inputStream = session1.read(inFlowFile)
            inputStream.eachLine { line -> 
               def outFlowFile = session2.create()
               outFlowFile = session2.write(outFlowFile, {outputStream -> 
                   outputStream.write(line.bytes)
               } as OutputStreamCallback)
               session2.transfer(outFlowFile, REL_SUCCESS)
               session2.commit()
            }
            inputStream.close()
            session1.transfer(inFlowFile, REL_ORIGINAL)
            session1.commit()
        } catch (final Throwable t) {
            log.error('{} failed to process due to {}; rolling back session', [this, t] as Object[])
            session2.rollback(true)
            session1.rollback(true)
            throw t
}}}
processor = new GroovyProcessor()

出于完整性考虑：

Split处理器被设计为支持Split / Merge模式，并且为了以后将它们合并回去，它们每个都需要相同的“父ID”以及数量。

如果在拆分所有内容之前将流文件发送出去，则您将不知道总数，以后将无法合并它们。另外，如果拆分处理出了点问题，您可能希望“回滚”该操作，而不是将某些流文件下游，然后将其余流文件发送失败

为了在所有处理之前发送一些流文件，您必须“提交处理会话”。这会阻止您执行上述操作，并且会中断传入流文件的出处，因为您必须在最初将其接收的会话中提交/传输该文件。所有后续提交都需要创建新的流文件，这打破了出处/血统链。

尽管为此提供了一个开放的Jira（NIFI-2878），但邮件列表上还是有一些异议，并且提出了有关将此功能添加到接受输入的处理器（即非源处理器）的请求。 NiFi的框架相当具有事务性，而这种功能实在是天生的。

Apache Nifi-在大文件上使用SplitText时，如何使放置文件立即写出

1 个答案: