有效地计算文件中的行数

时间:2014-12-18 01:57:28

标签: powershell

我试图在一个不那么小的文本文件中计算行数(多个MB)。我在这里找到的答案表明了这一点:

(Get-Content foo.txt | Measure-Object -Line).Lines

这样可行,但性能很差。我想整个文件都被加载到内存中而不是逐行流式传输。

我用Java创建了一个测试程序来比较性能:

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.TimeUnit;
import java.util.function.ToLongFunction;
import java.util.stream.Stream;

public class LineCounterPerformanceTest {
    public static void main(final String... args) {
        if (args.length > 0) {
            final String path = args[0];
            measure(LineCounterPerformanceTest::java, path);
            measure(LineCounterPerformanceTest::powershell, path);
        } else {
            System.err.println("Missing path.");
            System.exit(-1);
        }
    }

    private static long java(final String path) throws IOException {
        System.out.println("Java");
        try (final Stream<String> lines = Files.lines(Paths.get(path))) {
            return lines.count();
        }
    }

    private static long powershell(final String path) throws IOException, InterruptedException {
        System.out.println("Powershell");
        final Process ps = new ProcessBuilder("powershell", String.format("(Get-Content '%s' | Measure-Object -Line).Lines", path)).start();
        if (ps.waitFor(1, TimeUnit.MINUTES) && ps.exitValue() == 0) {
            try (final Scanner scanner = new Scanner(ps.getInputStream())) {
                return scanner.nextLong();
            }
        }
        throw new IOException("Timeout or error.");
    }

    private static <T, U extends T> void measure(final ExceptionalToLongFunction<T> function, final U value) {
        final long start = System.nanoTime();
        final long result = function.unchecked().applyAsLong(value);
        final long end = System.nanoTime();
        System.out.printf("Result: %d%n", result);
        System.out.printf("Elapsed time (ms): %,.6f%n%n", (end - start) / 1_000_000.);
    }

    @FunctionalInterface
    private static interface ExceptionalToLongFunction<T> {
        long applyAsLong(T value) throws Exception;

        default ToLongFunction<T> unchecked() {
            return (value) -> {
                try {
                    return applyAsLong(value);
                } catch (final Exception ex) {
                    throw new RuntimeException(ex);
                }
            };
        }
    }
}

普通Java解决方案的速度提高了约80倍。

是否有内置的方法来执行具有可比性能的任务?我在PowerShell 4.0上,如果重要的话。

3 个答案:

答案 0 :(得分:4)

看看这是否比您当前的方法更快:

$count = 0 
Get-Content foo.txt -ReadCount 2000 |
 foreach { $Count += $_.count } 

$count

答案 1 :(得分:1)

您可以将StreamReader用于此类事情。不确定它的速度与Java代码的比较,但我的理解是ReadLine方法一次只加载一行。

$StreamReader = New-Object System.IO.StreamReader($File)

$LineCount = 0

while ($StreamReader.ReadLine() -ne $null)
{
    $LineCount++
}

$StreamReader.Close()

答案 2 :(得分:0)

对于具有900多个字符长度行的GB +文件,SWITCH更快。

$count = 0; switch -File $filepath {default { ++$count }}