我有数百个大型(6GB)gziped日志文件,我正在使用我要解析的GZIPInputStream
来阅读。假设每个都具有以下格式:
Start of log entry 1
...some log details
...some log details
...some log details
Start of log entry 2
...some log details
...some log details
...some log details
Start of log entry 3
...some log details
...some log details
...some log details
我通过BufferedReader.lines()
逐行传输gziped文件内容。该流看起来像:
[
"Start of log entry 1",
" ...some log details",
" ...some log details",
" ...some log details",
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
]
每个日志条目的开头都可以由谓词line -> line.startsWith("Start of log entry")
标识。我想根据此谓词将此Stream<String>
转换为Stream<Stream<String>>
。每个“子流”应该在谓词为真时开始,并在谓词为假时收集行,直到下一次谓词为真,表示该子流的结束和下一个的开始。结果如下:
[
[
"Start of log entry 1",
" ...some log details",
" ...some log details",
" ...some log details",
],
[
"Start of log entry 2",
" ...some log details",
" ...some log details",
" ...some log details",
],
[
"Start of log entry 3",
" ...some log details",
" ...some log details",
" ...some log details",
],
]
从那里,我可以获取每个子流并将其映射到new LogEntry(Stream<String> logLines)
,以便将相关的日志行聚合到LogEntry
个对象中。
这是一个粗略的概念,看起来如何:
import java.io.*;
import java.nio.charset.*;
import java.util.*;
import java.util.function.*;
import java.util.stream.*;
import static java.lang.System.out;
class Untitled {
static final String input =
"Start of log entry 1\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details\n" +
"Start of log entry 2\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details\n" +
"Start of log entry 3\n" +
" ...some log details\n" +
" ...some log details\n" +
" ...some log details";
static final Predicate<String> isLogEntryStart = line -> line.startsWith("Start of log entry");
public static void main(String[] args) throws Exception {
try (ByteArrayInputStream gzipInputStream
= new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8)); // mock for fileInputStream based gzipInputStream
InputStreamReader inputStreamReader = new InputStreamReader( gzipInputStream );
BufferedReader reader = new BufferedReader( inputStreamReader )) {
reader.lines()
.splitByPredicate(isLogEntryStart) // <--- What witchcraft should go here?
.map(LogEntry::new)
.forEach(out::println);
}
}
}
约束:我有数百个这样的大文件要并行处理(但每个文件只有一个连续的流),这使得它们完全加载到内存中(例如将它们存储为List<String> lines
)是不可行。
任何帮助表示赞赏!
答案 0 :(得分:2)
我认为主要的问题是你是逐行阅读并尝试在行外创建一个LogEntry
实例,而不是逐块读取(可能包含很多行)。
为此,您可以使用Scanner.findAll
(自Java 9以来可用)使用正确的正则表达式:
String input =
"Start of log entry 1\n" +
" ...some log details 1.1\n" +
" ...some log details 1.2\n" +
" ...some log details 1.3\n" +
"Start of log entry 2\n" +
" ...some log details 2.1\n" +
" ...some log details 2.2\n" +
" ...some log details 2.3\n" +
"Start of log entry 3\n" +
" ...some log details 3.1\n" +
" ...some log details 3.2\n" +
" ...some log details 3.3";
try (ByteArrayInputStream gzip =
new ByteArrayInputStream(input.getBytes(StandardCharsets.UTF_8));
InputStreamReader reader = new InputStreamReader(gzip);
Scanner scanner = new Scanner(reader)) {
String START = "Start of log entry \\d+";
Pattern pattern = Pattern.compile(
START + "(?<=" + START + ").*?(?=" + START + "|$)",
Pattern.DOTALL);
scanner.findAll(pattern)
.map(MatchResult::group)
.map(s -> s.split("\\R"))
.map(LogEntry::new)
.forEach(System.out::println);
} catch (IOException e) {
throw new UncheckedIOException(e);
}
所以,这可以通过懒惰地找到Scanner
实例中的匹配来实现。 Scanner.findAll
返回Stream<MatchResult>
,MatchResult.group()
返回匹配的String
。然后我们用换行符(\\R
)拆分这个字符串。这将返回String[]
,其中数组的每个元素都是每行。然后,假设LogEntry
具有接受String[]
参数的构造函数,我们将这些数组中的每一个转换为LogEntry
实例。最后,假设LogEntry
有覆盖toString()
方法,我们会将每个LogEntry
实例打印到输出中。
值得一提的是,Scanner
在流上调用forEach
时开始工作。
另外一个注释是我们用来匹配输入中的日志条目的正则表达式。我不是正则表达式世界的专家,所以我几乎可以肯定这里有很大的改进空间。首先,我们使用Pattern.DOTALL
,以便.
不仅匹配常见字符,还匹配换行符。然后,有真正的正则表达式。我们的想法是匹配并使用Start of log entry \\d+
,然后它使用 look-behind 来对抗Start of log entry \\d+
,然后它会消耗来自非贪婪的输入中的字符方式(这是.*?
部分),最后向前看以检查是否有另一个Start of log entry \\d+
出现或者输入的结尾是否有已达成。如果你想深入研究这个主题,请参考这个amazing article about regular expressions。
如果您不使用Java 9+,我不知道任何类似的替代方案。但是,您可以做的是创建一个自定义Spliterator
,它包装由Spliterator
返回的流返回的BufferedReader.lines()
,并为其添加所需的解析行为。然后,您需要在此Stream
中创建一个新的Spliterator
。根本不是一项微不足道的任务......
答案 1 :(得分:1)
Frederico的回答可能是解决这一特殊问题的最佳方法。在他最后一次考虑自定义Spliterator
之后,我将添加a similar question的答案的改编版本,其中我建议使用自定义迭代器来创建分块流。此方法也适用于非输入读取器创建的其他流。
public class StreamSplitter<T>
implements Iterator<Stream<T>>
{
private Iterator<T> incoming;
private Predicate<T> startOfNewEntry;
private T nextLine;
public static <T> Stream<Stream<T>> streamOf(Stream<T> incoming, Predicate<T> startOfNewEntry)
{
Iterable<Stream<T>> iterable = () -> new StreamSplitter<>(incoming, startOfNewEntry);
return StreamSupport.stream(iterable.spliterator(), false);
}
private StreamSplitter(Stream<T> stream, Predicate<T> startOfNewEntry)
{
this.incoming = stream.iterator();
this.startOfNewEntry = startOfNewEntry;
if (incoming.hasNext())
nextLine = incoming.next();
}
@Override
public boolean hasNext()
{
return nextLine != null;
}
@Override
public Stream<T> next()
{
List<T> nextEntrysLines = new ArrayList<>();
do
{
nextEntrysLines.add(nextLine);
} while (incoming.hasNext()
&& !startOfNewEntry.test((nextLine = incoming.next())));
if (!startOfNewEntry.test(nextLine)) // incoming does not have next
nextLine = null;
return nextEntrysLines.stream();
}
}
示例强>
public static void main(String[] args)
{
Stream<String> flat = Stream.of("Start of log entry 1",
" ...some log details",
" ...some log details",
"Start of log entry 2",
" ...some log details",
" ...some log details",
"Start of log entry 3",
" ...some log details",
" ...some log details");
StreamSplitter.streamOf(flat, line -> line.matches("Start of log entry.*"))
.forEach(logEntry -> {
System.out.println("------------------");
logEntry.forEach(System.out::println);
});
}
// Output
// ------------------
// Start of log entry 1
// ...some log details
// ...some log details
// ------------------
// Start of log entry 2
// ...some log details
// ...some log details
// ------------------
// Start of log entry 3
// ...some log details
// ...some log details
迭代器总是向前看一行。只要该行是新条目的开头,它将包裹流中的前一个条目并将其作为next
返回。工厂方法streamOf
将此迭代器转换为要在上面给出的示例中使用的流。
我将拆分条件从正则表达式更改为Predicate
,因此您可以借助多个正则表达式,if条件等指定更复杂的条件。
请注意,我只使用上面的示例数据对其进行了测试,因此我不知道它将如何处理更复杂,错误或空的输入。