Question

我有一个包含网址和电子邮件的文本文件。我需要从文件中提取所有这些内容。每个URL和电子邮件可以找到一次以上，但结果不应包含重复项。我可以使用以下代码提取所有网址：

Files.lines(filePath).
    .map(urlPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

我可以使用以下代码提取所有电子邮件：

Files.lines(filePath).
    .map(emailPattern::matcher)
    .filter(Matcher::find)
    .map(Matcher::group)
    .distinct();

我可以只提取一次读取Files.lines(filePath)返回的流的所有网址和电子邮件吗？像将流线分割为URL流和电子邮件流一样。

Answer 1

您可以使用partitioningBy收集器，但它仍然不是非常优雅的解决方案。

Map<Boolean, List<String>> map = Files.lines(filePath)
        .filter(str -> urlPattern.matcher(str).matches() ||
                       emailPattern.matcher(str).matches())
        .distinct()
        .collect(Collectors.partitioningBy(str -> urlPattern.matcher(str).matches()));
List<String> urls = map.get(true);
List<String> emails = map.get(false);

如果您不想两次应用regexp，可以使用中间对对象（例如，SimpleEntry）：

public static String classify(String str) {
    return urlPattern.matcher(str).matches() ? "url" : 
        emailPattern.matcher(str).matches() ? "email" : null;
}

Map<String, Set<String>> map = Files.lines(filePath)
        .map(str -> new AbstractMap.SimpleEntry<>(classify(str), str))
        .filter(e -> e.getKey() != null)
        .collect(Collectors.groupingBy(e -> e.getKey(),
            Collectors.mapping(e -> e.getValue(), Collectors.toSet())));

使用我的免费StreamEx库，最后一步会更短：

Map<String, Set<String>> map = StreamEx.of(Files.lines(filePath))
        .mapToEntry(str -> classify(str), Function.identity())
        .nonNullKeys()
        .grouping(Collectors.toSet());

Answer 2

您可以在Map<String,Set<String>> map=Files.lines(filePath) .collect(HashMap::new, (hm,line)-> { Matcher m=emailPattern.matcher(line); if(m.matches()) hm.computeIfAbsent("mail", x->new HashSet<>()).add(line); else if(m.usePattern(urlPattern).matches()) hm.computeIfAbsent("url", x->new HashSet<>()).add(line); }, (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v, (s1,s2)->{s1.addAll(s2); return s1;})) ); Set<String> mail=map.get("mail"), url=map.get("url");：

中执行匹配

Map<String,Set<String>> map=Files.lines(filePath)
    .collect(HashMap::new,
        (hm,line)-> {
            Matcher m=emailPattern.matcher(line);
            while(m.find())
              hm.computeIfAbsent("mail", x->new HashSet<>()).add(m.group());
            m.usePattern(urlPattern).reset();
            while(m.find())
              hm.computeIfAbsent("url", x->new HashSet<>()).add(m.group());
        },
        (m1,m2)-> m2.forEach((k,v)->m1.merge(k, v,
                                     (s1,s2)->{s1.addAll(s2); return s1;}))
    );

请注意，这可以很容易地适应在一行中找到多个匹配项：

setlocal enableDelayedExpansion
for %%x in ("<FILELOCATION>\*.csv") do (
For /f "tokens=2 delims=," %%I in (%%x) do ( 
       <COMMAND> 2>&1 1>>C:\0-Migration\log.txt 
       ECHO %%~I !date!, !time!  2>&1 1>>C:\0-Migration\log.txt 
)

Answer 3

由于您无法重复使用Stream，因此我认为唯一的选择就是“手动执行”。

File.lines(filePath).forEach(s -> /** match and sort into two lists */ );

如果有另一个解决方案，虽然我很乐意了解它！

Answer 4

整体问题应该是：为什么你只想要只流一次？

提取URL和提取电子邮件是不同的操作，因此应该在他们自己的流操作中处理。即使基础流源包含数十万条记录，与映射和过滤操作相比，迭代的时间也可以忽略不计。

作为可能的性能问题，您唯一应该考虑的是IO操作。因此，最干净的解决方案是只读取一次文件，然后在结果集合上流式传输两次：

List<String> allLines = Files.readAllLines(filePath);
allLines.stream() ... // here do the URLs
allLines.stream() ... // here do the emails

当然这需要一些记忆。

拆分java.util.stream.Stream

4 个答案: