Java通过谓词将流分割为流的流

时间:2018-03-27 23:57:22

标签: java split java-stream lazy-evaluation predicate

我有数百个大型(6GB)gziped日志文件,我正在使用我要解析的GZIPInputStream来阅读。假设每个都具有以下格式:

Start of log entry 1
    ...some log details
    ...some log details
    ...some log details
Start of log entry 2
    ...some log details
    ...some log details
    ...some log details
Start of log entry 3
    ...some log details
    ...some log details
    ...some log details

我通过BufferedReader.lines()逐行传输gziped文件内容。该流看起来像:

[
    "Start of log entry 1",
    "    ...some log details",
    "    ...some log details",
    "    ...some log details",
    "Start of log entry 2",
    "    ...some log details",
    "    ...some log details",
    "    ...some log details",
    "Start of log entry 2",
    "    ...some log details",
    "    ...some log details",
    "    ...some log details",
]

每个日志条目的开头都可以由谓词line -> line.startsWith("Start of log entry")标识。我想根据此谓词将此Stream<String>转换为Stream<Stream<String>>。每个“子流”应该在谓词为真时开始,并在谓词为假时收集行,直到下一次谓词为真,表示该子流的结束和下一个的开始。结果如下:

[
    [
        "Start of log entry 1",
        "    ...some log details",
        "    ...some log details",
        "    ...some log details",
    ],
    [
        "Start of log entry 2",
        "    ...some log details",
        "    ...some log details",
        "    ...some log details",
    ],
    [
        "Start of log entry 3",
        "    ...some log details",
        "    ...some log details",
        "    ...some log details",
    ],
]

从那里,我可以获取每个子流并将其映射到new LogEntry(Stream<String> logLines),以便将相关的日志行聚合到LogEntry个对象中。

这是一个粗略的概念,看起来如何:

import java.io.*;
import java.nio.charset.*;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;

import static java.lang.System.out;

class Untitled {
    static final String input = 
        "Start of log entry 1\n" +
        "    ...some log details\n" +
        "    ...some log details\n" +
        "    ...some log details\n" +
        "Start of log entry 2\n" +
        "    ...some log details\n" +
        "    ...some log details\n" +
        "    ...some log details\n" +
        "Start of log entry 3\n" +
        "    ...some log details\n" +
        "    ...some log details\n" +
        "    ...some log details";

    static final Predicate<String> isLogEntryStart = line -> line.startsWith("Start of log entry"); 

    public static void main(String[] args) throws Exception {
        try (ByteArrayInputStream gzipInputStream
        = new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8)); // mock for fileInputStream based gzipInputStream
             InputStreamReader inputStreamReader = new InputStreamReader( gzipInputStream ); 
             BufferedReader reader = new BufferedReader( inputStreamReader )) {

            reader.lines()
                .splitByPredicate(isLogEntryStart) // <--- What witchcraft should go here?
                .map(LogEntry::new)
                .forEach(out::println);
        }
    }
}

约束:我有数百个这样的大文件要并行处理(但每个文件只有一个连续的流),这使得它们完全加载到内存中(例如将它们存储为List<String> lines)是不可行。

任何帮助表示赞赏!

2 个答案:

答案 0 :(得分:2)

我认为主要的问题是你是逐行阅读并尝试在行外创建一个LogEntry实例,而不是逐块读取(可能包含很多行)。

为此,您可以使用Scanner.findAll(自Java 9以来可用)使用正确的正则表达式:

String input =
        "Start of log entry 1\n"        +
        "    ...some log details 1.1\n" +
        "    ...some log details 1.2\n" +
        "    ...some log details 1.3\n" +
        "Start of log entry 2\n"        +
        "    ...some log details 2.1\n" +
        "    ...some log details 2.2\n" +
        "    ...some log details 2.3\n" +
        "Start of log entry 3\n"        +
        "    ...some log details 3.1\n" +
        "    ...some log details 3.2\n" +
        "    ...some log details 3.3";

try (ByteArrayInputStream gzip = 
         new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8));
     InputStreamReader reader = new InputStreamReader(gzip);
     Scanner scanner = new Scanner(reader)) {

    String START = "Start of log entry \\d+";
    Pattern pattern = Pattern.compile(
            START + "(?<=" + START + ").*?(?=" + START + "|$)", 
            Pattern.DOTALL);

    scanner.findAll(pattern)
            .map(MatchResult::group)
            .map(s -> s.split("\\R"))
            .map(LogEntry::new)
            .forEach(System.out::println);

} catch (IOException e) {
    throw new UncheckedIOException(e);
}

所以,这可以通过懒惰地找到Scanner实例中的匹配来实现。 Scanner.findAll返回Stream<MatchResult>MatchResult.group()返回匹配的String。然后我们用换行符(\\R)拆分这个字符串。这将返回String[],其中数组的每个元素都是每行。然后,假设LogEntry具有接受String[]参数的构造函数,我们将这些数组中的每一个转换为LogEntry实例。最后,假设LogEntry有覆盖toString()方法,我们会将每个LogEntry实例打印到输出中。

值得一提的是,Scanner在流上调用forEach时开始工作。

另外一个注释是我们用来匹配输入中的日志条目的正则表达式。我不是正则表达式世界的专家,所以我几乎可以肯定这里有很大的改进空间。首先,我们使用Pattern.DOTALL,以便.不仅匹配常见字符,还匹配换行符。然后,有真正的正则表达式。我们的想法是匹配并使用Start of log entry \\d+,然后它使用 look-behind 来对抗Start of log entry \\d+,然后它会消耗来自非贪婪的输入中的字符方式(这是.*?部分),最后向前看以检查是否有另一个Start of log entry \\d+出现或者输入的结尾是否有已达成。如果你想深入研究这个主题,请参考这个amazing article about regular expressions

如果您不使用Java 9+,我不知道任何类似的替代方案。但是,您可以做的是创建一个自定义Spliterator,它包装由Spliterator返回的流返回的BufferedReader.lines(),并为其添加所需的解析行为。然后,您需要在此Stream中创建一个新的Spliterator。根本不是一项微不足道的任务......

答案 1 :(得分:1)

Frederico的回答可能是解决这一特殊问题的最佳方法。在他最后一次考虑自定义Spliterator之后,我将添加a similar question的答案的改编版本,其中我建议使用自定义迭代器来创建分块流。此方法也适用于非输入读取器创建的其他流。

public class StreamSplitter<T>
    implements Iterator<Stream<T>>
{
    private Iterator<T>  incoming;
    private Predicate<T> startOfNewEntry;
    private T            nextLine;

    public static <T> Stream<Stream<T>> streamOf(Stream<T> incoming, Predicate<T> startOfNewEntry)
    {
        Iterable<Stream<T>> iterable = () -> new StreamSplitter<>(incoming, startOfNewEntry);
        return StreamSupport.stream(iterable.spliterator(), false);
    }

    private StreamSplitter(Stream<T> stream, Predicate<T> startOfNewEntry)
    {
        this.incoming = stream.iterator();
        this.startOfNewEntry = startOfNewEntry;
        if (incoming.hasNext())
            nextLine = incoming.next();
    }

    @Override
    public boolean hasNext()
    {
        return nextLine != null;
    }

    @Override
    public Stream<T> next()
    {
        List<T> nextEntrysLines = new ArrayList<>();
        do
        {
            nextEntrysLines.add(nextLine);
        } while (incoming.hasNext()
                 && !startOfNewEntry.test((nextLine = incoming.next())));

        if (!startOfNewEntry.test(nextLine)) // incoming does not have next
            nextLine = null;

        return nextEntrysLines.stream();
    }
}

示例

public static void main(String[] args)
{
    Stream<String> flat = Stream.of("Start of log entry 1",
                                    "    ...some log details",
                                    "    ...some log details",
                                    "Start of log entry 2",
                                    "    ...some log details",
                                    "    ...some log details",
                                    "Start of log entry 3",
                                    "    ...some log details",
                                    "    ...some log details");

    StreamSplitter.streamOf(flat, line -> line.matches("Start of log entry.*"))
                  .forEach(logEntry -> {
                      System.out.println("------------------");
                      logEntry.forEach(System.out::println);
                  });
}

// Output
// ------------------
// Start of log entry 1
//     ...some log details
//     ...some log details
// ------------------
// Start of log entry 2
//     ...some log details
//     ...some log details
// ------------------
// Start of log entry 3
//     ...some log details
//     ...some log details

迭代器总是向前看一行。只要该行是新条目的开头,它将包裹流中的前一个条目并将其作为next返回。工厂方法streamOf将此迭代器转换为要在上面给出的示例中使用的流。

我将拆分条件从正则表达式更改为Predicate,因此您可以借助多个正则表达式,if条件等指定更复杂的条件。

请注意,我只使用上面的示例数据对其进行了测试,因此我不知道它将如何处理更复杂,错误或空的输入。