Java中文件中的行数

时间:2009-01-17 08:59:06

标签: java large-files line-numbers

我使用大量数据文件,有时我只需要知道这些文件中的行数,通常我打开它们并逐行读取它们直到我到达文件末尾

我想知道是否有更聪明的方法

19 个答案:

答案 0 :(得分:227)

这是迄今为止我发现的最快版本,比readLines快6倍。在150MB日志文件上,这需要0.35秒,而使用readLines()时需要2.40秒。只是为了好玩,linux'wc -l命令需要0.15秒。

public static int countLinesOld(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean empty = true;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
        }
        return (count == 0 && !empty) ? 1 : count;
    } finally {
        is.close();
    }
}
编辑,9年半以后:我几乎没有Java经验,但无论如何我试图将此代码与下面的LineNumberReader解决方案进行对比,因为它困扰我,没有人这样做。似乎特别是对于大文件我的解决方案更快。虽然在优化器完成一项体面的工作之前似乎需要几次运行。我已经玩了一些代码,并创建了一个最快的新版本:

public static int countLinesNew(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];

        int readChars = is.read(c);
        if (readChars == -1) {
            // bail out if nothing to read
            return 0;
        }

        // make it easy for the optimizer to tune this loop
        int count = 0;
        while (readChars == 1024) {
            for (int i=0; i<1024;) {
                if (c[i++] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        // count remaining characters
        while (readChars != -1) {
            System.out.println(readChars);
            for (int i=0; i<readChars; ++i) {
                if (c[i] == '\n') {
                    ++count;
                }
            }
            readChars = is.read(c);
        }

        return count == 0 ? 1 : count;
    } finally {
        is.close();
    }
}

基准测试结果为1.3GB文本文件,y轴以秒为单位。我使用相同的文件执行了100次运行,并使用System.nanoTime()测量了每次运行。您可以看到countLinesOld有一些异常值,countLinesNew没有异常值,虽然它只有一点点快,但差异具有统计显着性。 LineNumberReader明显变慢了。

Benchmark Plot

答案 1 :(得分:195)

我已经实现了另一个问题的解决方案,我发现它在计算行数方面更有效:

try
(
   FileReader       input = new FileReader("input.txt");
   LineNumberReader count = new LineNumberReader(input);
)
{
   while (count.skip(Long.MAX_VALUE) > 0)
   {
      // Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
   }

   result = count.getLineNumber() + 1;                                    // +1 because line index starts at 0
}

答案 2 :(得分:28)

对于不以换行结尾的多行文件,已接受的答案有一个错误。以换行符结尾的单行文件将返回1,但是没有换行符的两行文件也将返回1。以下是修复此问题的已接受解决方案的实现。除了最终阅读之外,endsWithoutNewLine检查对于所有事情都是浪费,但与整体功能相比,应该是非常简单的时间。

public int count(String filename) throws IOException {
    InputStream is = new BufferedInputStream(new FileInputStream(filename));
    try {
        byte[] c = new byte[1024];
        int count = 0;
        int readChars = 0;
        boolean endsWithoutNewLine = false;
        while ((readChars = is.read(c)) != -1) {
            for (int i = 0; i < readChars; ++i) {
                if (c[i] == '\n')
                    ++count;
            }
            endsWithoutNewLine = (c[readChars - 1] != '\n');
        }
        if(endsWithoutNewLine) {
            ++count;
        } 
        return count;
    } finally {
        is.close();
    }
}

答案 3 :(得分:20)

使用,您可以使用流:

try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
  long numOfLines = lines.count();
  ...
}

答案 4 :(得分:12)

如果文件末尾没有换行符,上面方法count()的答案给了我行错误计数 - 它无法计算文件中的最后一行。

这种方法对我来说效果更好:

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}

cnt = reader.getLineNumber(); 
reader.close();
return cnt;
}

答案 5 :(得分:8)

我知道这是一个老问题,但是接受的解决方案与我需要做的并不完全相符。因此,我对它进行了改进以接受各种行终止符(而不仅仅是换行符)并使用指定的字符编码(而不是ISO-8859- n )。所有在一个方法(适当的重构):

public static long getLinesCount(String fileName, String encodingName) throws IOException {
    long linesCount = 0;
    File file = new File(fileName);
    FileInputStream fileIn = new FileInputStream(file);
    try {
        Charset encoding = Charset.forName(encodingName);
        Reader fileReader = new InputStreamReader(fileIn, encoding);
        int bufferSize = 4096;
        Reader reader = new BufferedReader(fileReader, bufferSize);
        char[] buffer = new char[bufferSize];
        int prevChar = -1;
        int readCount = reader.read(buffer);
        while (readCount != -1) {
            for (int i = 0; i < readCount; i++) {
                int nextChar = buffer[i];
                switch (nextChar) {
                    case '\r': {
                        // The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
                        linesCount++;
                        break;
                    }
                    case '\n': {
                        if (prevChar == '\r') {
                            // The current line is terminated by a carriage return immediately followed by a line feed.
                            // The line has already been counted.
                        } else {
                            // The current line is terminated by a line feed.
                            linesCount++;
                        }
                        break;
                    }
                }
                prevChar = nextChar;
            }
            readCount = reader.read(buffer);
        }
        if (prevCh != -1) {
            switch (prevCh) {
                case '\r':
                case '\n': {
                    // The last line is terminated by a line terminator.
                    // The last line has already been counted.
                    break;
                }
                default: {
                    // The last line is terminated by end-of-file.
                    linesCount++;
                }
            }
        }
    } finally {
        fileIn.close();
    }
    return linesCount;
}

此解决方案在速度上与可接受的解决方案相当,在我的测试中慢了约4%(尽管Java中的时序测试非常不可靠)。

答案 6 :(得分:4)

我测试了上述计算行数的方法,这是我在系统上测试的不同方法的观察结果

文件大小:1.6 Gb 方法:

  1. 使用扫描仪:大约35秒
  2. 使用BufferedReader :大约5s
  3. 使用Java 8 :大约5s
  4. 使用LineNumberReader :大约5秒

此外,Java8方法似乎非常方便:Files.lines(Paths.get(filePath),Charset.defaultCharset())。count()[返回类型:长]

答案 7 :(得分:4)

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (Stream<String> lines = Files.lines(file.toPath())) {
        return lines.count();
    }
}

在JDK8_u31上测试过。但与此方法相比,性能确实很慢:

/**
 * Count file rows.
 *
 * @param file file
 * @return file row count
 * @throws IOException
 */
public static long getLineCount(File file) throws IOException {

    try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {

        byte[] c = new byte[1024];
        boolean empty = true,
                lastEmpty = false;
        long count = 0;
        int read;
        while ((read = is.read(c)) != -1) {
            for (int i = 0; i < read; i++) {
                if (c[i] == '\n') {
                    count++;
                    lastEmpty = true;
                } else if (lastEmpty) {
                    lastEmpty = false;
                }
            }
            empty = false;
        }

        if (!empty) {
            if (count == 0) {
                count = 1;
            } else if (!lastEmpty) {
                count++;
            }
        }

        return count;
    }
}

经过测试,非常快。

答案 8 :(得分:3)

我得出结论,wc -l:计算换行符的方法很好但是在最后一行不以换行符结尾的文件上返回非直观的结果。

基于LineNumberReader的@ er.vikas解决方案,但在行计数中添加一个,在最后一行以换行结束的文件上返回非直观结果。

因此,我做了一个处理如下的算法:

@Test
public void empty() throws IOException {
    assertEquals(0, count(""));
}

@Test
public void singleNewline() throws IOException {
    assertEquals(1, count("\n"));
}

@Test
public void dataWithoutNewline() throws IOException {
    assertEquals(1, count("one"));
}

@Test
public void oneCompleteLine() throws IOException {
    assertEquals(1, count("one\n"));
}

@Test
public void twoCompleteLines() throws IOException {
    assertEquals(2, count("one\ntwo\n"));
}

@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
    assertEquals(2, count("one\ntwo"));
}

@Test
public void aFewLines() throws IOException {
    assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}

它看起来像这样:

static long countLines(InputStream is) throws IOException {
    try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
        char[] buf = new char[8192];
        int n, previousN = -1;
        //Read will return at least one byte, no need to buffer more
        while((n = lnr.read(buf)) != -1) {
            previousN = n;
        }
        int ln = lnr.getLineNumber();
        if (previousN == -1) {
            //No data read at all, i.e file was empty
            return 0;
        } else {
            char lastChar = buf[previousN - 1];
            if (lastChar == '\n' || lastChar == '\r') {
                //Ending with newline, deduct one
                return ln;
            }
        }
        //normal case, return line number + 1
        return ln + 1;
    }
}

如果您想要直观的结果,可以使用它。如果您只想要wc -l兼容性,只需使用@ er.vikas解决方案,但不要在结果中添加一个并重试跳过:

try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
    while(lnr.skip(Long.MAX_VALUE) > 0){};
    return lnr.getLineNumber();
}

答案 9 :(得分:3)

使用扫描仪的直接方式

static void lineCounter (String path) throws IOException {

        int lineCount = 0, commentsCount = 0;

        Scanner input = new Scanner(new File(path));
        while (input.hasNextLine()) {
            String data = input.nextLine();

            if (data.startsWith("//")) commentsCount++;

            lineCount++;
        }

        System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
    }

答案 10 :(得分:2)

如何在Java代码中使用Process类?然后读取命令的输出。

Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();

BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
    System.out.println(line);
    lineCount = Integer.parseInt(line);
}

需要尝试一下。将发布结果。

答案 11 :(得分:1)

似乎LineNumberReader可以采用几种不同的方法。

我这样做了

int lines = 0;

FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);

String line = count.readLine();

if(count.ready())
{
    while(line != null) {
        lines = count.getLineNumber();
        line = count.readLine();
    }
    
    lines+=1;
}
    
count.close();

System.out.println(lines);

更简单的是,您可以使用Java BufferedReader lines()方法返回元素流,然后使用Stream count()方法对所有元素进行计数。然后只需在输出中添加一个即可获得文本文件中的行数。

例如:

FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);

int lines = (int)count.lines().count() + 1;
    
count.close();

System.out.println(lines);

答案 12 :(得分:1)

这个有趣的解决方案实际上非常好用!

public static int countLines(File input) throws IOException {
    try (InputStream is = new FileInputStream(input)) {
        int count = 1;
        for (int aChar = 0; aChar != -1;aChar = is.read())
            count += aChar == '\n' ? 1 : 0;
        return count;
    }
}

答案 13 :(得分:1)

如果您没有任何索引结构,则无法阅读完整文件。但您可以通过避免逐行读取并使用正则表达式来匹配所有行终止符来优化它。

答案 14 :(得分:0)

针对多行文件的最佳优化代码,在EOF处没有换行符('\ n')。

/**
 * 
 * @param filename
 * @return
 * @throws IOException
 */
public static int countLines(String filename) throws IOException {
    int count = 0;
    boolean empty = true;
    FileInputStream fis = null;
    InputStream is = null;
    try {
        fis = new FileInputStream(filename);
        is = new BufferedInputStream(fis);
        byte[] c = new byte[1024];
        int readChars = 0;
        boolean isLine = false;
        while ((readChars = is.read(c)) != -1) {
            empty = false;
            for (int i = 0; i < readChars; ++i) {
                if ( c[i] == '\n' ) {
                    isLine = false;
                    ++count;
                }else if(!isLine && c[i] != '\n' && c[i] != '\r'){   //Case to handle line count where no New Line character present at EOF
                    isLine = true;
                }
            }
        }
        if(isLine){
            ++count;
        }
    }catch(IOException e){
        e.printStackTrace();
    }finally {
        if(is != null){
            is.close();    
        }
        if(fis != null){
            fis.close();    
        }
    }
    LOG.info("count: "+count);
    return (count == 0 && !empty) ? 1 : count;
}

答案 15 :(得分:0)

只知道文件中有多少行就是计算它们。您当然可以从数据中创建一个度量标准,为您提供一行的平均长度,然后获取文件大小并将其除以平均值。长度但不准确。

答案 16 :(得分:0)

使用正则表达式的扫描器:

public int getLineCount() {
    Scanner fileScanner = null;
    int lineCount = 0;
    Pattern lineEndPattern = Pattern.compile("(?m)$");  
    try {
        fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
        while (fileScanner.hasNext()) {
            fileScanner.next();
            ++lineCount;
        }   
    }catch(FileNotFoundException e) {
        e.printStackTrace();
        return lineCount;
    }
    fileScanner.close();
    return lineCount;
}

没有计时。

答案 17 :(得分:0)

在基于Unix的系统上,在命令行上使用wc命令。

答案 18 :(得分:-2)

如果你使用这个

public int countLines(String filename) throws IOException {
    LineNumberReader reader  = new LineNumberReader(new FileReader(filename));
    int cnt = 0;
    String lineRead = "";
    while ((lineRead = reader.readLine()) != null) {}

    cnt = reader.getLineNumber(); 
    reader.close();
    return cnt;
}

你不能运行大数行,喜欢100K行,因为从reader.getLineNumber返回是int。您需要长类型的数据来处理最大行..