Java:在每个第309个字符后插入换行符

时间:2010-08-03 16:09:59

标签: java split newline

让我先说一下我是Java的新手。

我有一个包含一行的文件。文件大小约为200MB。我需要在每个第309个字符后插入一个换行符。我相信我有正确执行此操作的代码,但我一直遇到内存错误。我试过增加堆空间无济于事。

是否有较少的内存密集型处理方式?

BufferedReader r = new BufferedReader(new FileReader(fileName));

String line;

while ((line=r.readLine()) != null) {
  System.out.println(line.replaceAll("(.{309})", "$1\n"));
}

7 个答案:

答案 0 :(得分:15)

您的代码有两个问题:

  1. 您正在将整个文件一次性加载到内存中,假设它是一行,因此您需要至少200MB的堆空间;以及

  2. 添加新行来使用像这样的正则表达式是一种非常低效的方法。直接的代码解决方案将快一个数量级。

  3. 这两个问题都很容易解决。

    使用FileReaderFileWriter一次加载309个字符,附加换行符并将其写出来。

    更新:添加了逐个字符和缓冲读取的测试。缓冲读数实际上增加了很多复杂性,因为您需要满足可能(但通常非常罕见)的情况,其中read()返回的字节数少于您要求的仍有字节阅读。

    首先是简单版本:

    private static void charRead(boolean verifyHash) {
      Reader in = null;
      Writer out = null;
      long start = System.nanoTime();
      long wrote = 0;
      MessageDigest md = null;
      try {
        if (verifyHash) {
          md = MessageDigest.getInstance("SHA1");
        }
        in = new BufferedReader(new FileReader(IN_FILE));
        out = new BufferedWriter(new FileWriter(CHAR_FILE));
        int count = 0;
        for (int c = in.read(); c != -1; c = in.read()) {
          if (verifyHash) {
            md.update((byte) c);
          }
          out.write(c);
          wrote++;
          if (++count >= COUNT) {
            if (verifyHash) {
              md.update((byte) '\n');
            }
            out.write("\n");
            wrote++;
            count = 0;
          }
        }
      } catch (IOException e) {
        throw new RuntimeException(e);
      } catch (NoSuchAlgorithmException e) {
        throw new RuntimeException(e);
      } finally {
        safeClose(in);
        safeClose(out);
        long end = System.nanoTime();
        System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
            CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
      }
    }
    

    和“阻止”版本:

    private static void blockRead(boolean verifyHash) {
      Reader in = null;
      Writer out = null;
      long start = System.nanoTime();
      long wrote = 0;
      MessageDigest md = null;
      try {
        if (verifyHash) {
          md = MessageDigest.getInstance("SHA1");
        }
        in = new BufferedReader(new FileReader(IN_FILE));
        out = new BufferedWriter(new FileWriter(BLOCK_FILE));
        char[] buf = new char[COUNT + 1]; // leave a space for the newline
        int lastRead = in.read(buf, 0, COUNT); // read in 309 chars at a time
        while (lastRead != -1) { // end of file
          // technically less than 309 characters may have been read
          // this is very unusual but possible so we need to keep
          // reading until we get all the characters we want
          int totalRead = lastRead;
          while (totalRead < COUNT) {
            lastRead = in.read(buf, totalRead, COUNT - totalRead);
            if (lastRead == -1) {
              break;
            } else {
              totalRead++;
            }
          }
    
          // if we get -1, it'll eventually signal an exit but first
          // we must write any characters we have read
          // note: it is assumed that the trailing number, which may be
          // less than 309 will still have a newline appended. this may
          // note be the case
          if (totalRead == COUNT) {
            buf[totalRead++] = '\n';
          }
          if (totalRead > 0) {
            out.write(buf, 0, totalRead);
            if (verifyHash) {
              md.update(new String(buf, 0, totalRead).getBytes("UTF-8"));
            }
            wrote += totalRead;
          }
    
          // don't try and read again if we've already hit EOF
          if (lastRead != -1) {
            lastRead = in.read(buf, 0, 309);
          }
        }
      } catch (IOException e) {
        throw new RuntimeException(e);
      } catch (NoSuchAlgorithmException e) {
        throw new RuntimeException(e);
      } finally {
        safeClose(in);
        safeClose(out);
        long end = System.nanoTime();
        System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
            CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
      }
    }
    

    创建测试文件的方法:

    private static void createFile() {
      Writer out = null;
      long start = System.nanoTime();
      try {
        out = new BufferedWriter(new FileWriter(IN_FILE));
        Random r = new Random();
        for (int i = 0; i < SIZE; i++) {
          out.write(CHARS[r.nextInt(CHARS.length)]);
        }
      } catch (IOException e) {
        throw new RuntimeException(e);
      } finally {
        safeClose(out);
        long end = System.nanoTime();
        System.out.printf("Created %s size %,d in %,.3f seconds%n",
          IN_FILE, SIZE, (end - start) / 1000000000.0d);
      }
    }
    

    这些都假设:

    private static final int SIZE = 200000000;
    private static final int COUNT = 309;
    private static final char[] CHARS;
    private static final char[] BYTES = new char[]{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
    private static final String IN_FILE = "E:\\temp\\in.dat";
    private static final String CHAR_FILE = "E:\\temp\\char.dat";
    private static final String BLOCK_FILE = "E:\\temp\\block.dat";
    
    static {
      char[] chars = new char[1000];
      int nchars = 0;
      for (char c = 'a'; c <= 'z'; c++) {
        chars[nchars++] = c;
        chars[nchars++] = Character.toUpperCase(c);
      }
      for (char c = '0'; c <= '9'; c++) {
        chars[nchars++] = c;
      }
      chars[nchars++] = ' ';
      CHARS = new char[nchars];
      System.arraycopy(chars, 0, CHARS, 0, nchars);
    }
    

    运行此测试:

    public static void main(String[] args) {
      if (!new File(IN_FILE).exists()) {
        createFile();
      }
      charRead(true);
      charRead(true);
      charRead(false);
      charRead(false);
      blockRead(true);
      blockRead(true);
      blockRead(false);
      blockRead(false);
    }
    

    给出了这个结果(Intel Q9450,Windows 7 64位,8GB RAM,在7200rpm 1.5TB驱动器上运行测试):

    Created E:\temp\char.dat size 200,647,249 in 29.690 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 18.177 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 7.911 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 7.867 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 8.018 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 7.949 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 3.958 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 3.909 seconds. Hash: (not calculated)
    

    结论: SHA1哈希验证非常昂贵,这就是我运行和不运行版本的原因。基本上在热身之后,“高效”版本的速度只有2倍。我猜这个时候文件实际上是在内存中。

    如果我颠倒了块的顺序并且char读取,结果是:

    Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
    Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
    Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)
    

    有趣的是,逐字符版本在首次读取文件时会占用更大的初始值。

    因此,按照惯例,它是效率和简单性之间的选择。

答案 1 :(得分:2)

打开它并一次读取一个字符,然后将该字符写入需要的位置。保留一个计数器,每次计数器足够大时,写出换行符并将计数器设置为零。

答案 2 :(得分:2)

读入长度为309的字节数组,然后写入读取的字节:

   import java.io.*;



   public class Test {
      public static void main(String[] args) throws Exception  {
         InputStream in = null;
         byte[] chars = new byte[309];
         try   {
            in = new FileInputStream(args[0]);
            int read = 0;

            while((read = in.read(chars)) != -1)   {
               System.out.write(chars, 0, read);
               System.out.println("");
            }
         }finally {
            if(in != null)  {
               in.close();
            }
         }
      }

   }

答案 3 :(得分:1)

不确定这个解决方案有多好,但你总是可以逐字逐句地阅读它。

  1. 读入309个字符并写入文件。不确定您是否可以立即执行此操作,或者您是否必须一次只能由一个角色执行此操作
  2. 将第309个字符输出后,将换行符输入文件
  3. 重复
  4. 例如(使用this网站):

    FileInputStream fis = new FileInputStream(file);
    char current;
    int counter = 0
       while (fis.available() > 0) {
          current = (char) fis.read();
          counter++;
          // output current to file
          if ((counter%309) = 0) {
             //output newline character
          }
       }
    

答案 4 :(得分:1)

不要使用BufferedReader,它会将大部分底层文件保留在内存中。直接使用FileReader,然后使用read()方法获取所需数据:

FileReader reader = new FileReader(fileName);
char[] buffer = new char[309];
int charsRead = 0;

while ((charsRead = reader.read(buffer, 0, buffer.length)) == buffer.length)
{
    System.out.println(new String(buffer));
}
if (charsRead > 0)
{
     // print any trailing chars
     System.out.println(new String(buffer, 0, charsRead));
}

答案 5 :(得分:0)

将文件读取器包装在BufferedReader中,然后保持循环,一次读取309个字符。

像(未经测试)的东西:

BufferedReader r = new BufferedReader(new FileReader("yourfile.txt"), 1024);
boolean done = false;
char[] buffer = new char[309];
while(!done)
{
   int read = r.read(buffer,0,309);
   if(read > 0)
   {
     //write buffer to dfestination, appending newline
   }
   else
   {
        done = true;
   }
}

答案 6 :(得分:0)

您可以将程序更改为:

 BufferedReader r = null;

 r = new BufferedReader(new FileReader(fileName));
 char[] data = new char[309];

 while (r.read(data, 0, 309) > 0) {
     System.out.println(new String(data) + "\n");
 }

这是我的头脑而未经过测试。