让我先说一下我是Java的新手。
我有一个包含一行的文件。文件大小约为200MB。我需要在每个第309个字符后插入一个换行符。我相信我有正确执行此操作的代码,但我一直遇到内存错误。我试过增加堆空间无济于事。
是否有较少的内存密集型处理方式?
BufferedReader r = new BufferedReader(new FileReader(fileName));
String line;
while ((line=r.readLine()) != null) {
System.out.println(line.replaceAll("(.{309})", "$1\n"));
}
答案 0 :(得分:15)
您的代码有两个问题:
您正在将整个文件一次性加载到内存中,假设它是一行,因此您需要至少200MB的堆空间;以及
添加新行来使用像这样的正则表达式是一种非常低效的方法。直接的代码解决方案将快一个数量级。
这两个问题都很容易解决。
使用FileReader
和FileWriter
一次加载309个字符,附加换行符并将其写出来。
更新:添加了逐个字符和缓冲读取的测试。缓冲读数实际上增加了很多复杂性,因为您需要满足可能(但通常非常罕见)的情况,其中read()
返回的字节数少于您要求的和仍有字节阅读。
首先是简单版本:
private static void charRead(boolean verifyHash) {
Reader in = null;
Writer out = null;
long start = System.nanoTime();
long wrote = 0;
MessageDigest md = null;
try {
if (verifyHash) {
md = MessageDigest.getInstance("SHA1");
}
in = new BufferedReader(new FileReader(IN_FILE));
out = new BufferedWriter(new FileWriter(CHAR_FILE));
int count = 0;
for (int c = in.read(); c != -1; c = in.read()) {
if (verifyHash) {
md.update((byte) c);
}
out.write(c);
wrote++;
if (++count >= COUNT) {
if (verifyHash) {
md.update((byte) '\n');
}
out.write("\n");
wrote++;
count = 0;
}
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} finally {
safeClose(in);
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
}
}
和“阻止”版本:
private static void blockRead(boolean verifyHash) {
Reader in = null;
Writer out = null;
long start = System.nanoTime();
long wrote = 0;
MessageDigest md = null;
try {
if (verifyHash) {
md = MessageDigest.getInstance("SHA1");
}
in = new BufferedReader(new FileReader(IN_FILE));
out = new BufferedWriter(new FileWriter(BLOCK_FILE));
char[] buf = new char[COUNT + 1]; // leave a space for the newline
int lastRead = in.read(buf, 0, COUNT); // read in 309 chars at a time
while (lastRead != -1) { // end of file
// technically less than 309 characters may have been read
// this is very unusual but possible so we need to keep
// reading until we get all the characters we want
int totalRead = lastRead;
while (totalRead < COUNT) {
lastRead = in.read(buf, totalRead, COUNT - totalRead);
if (lastRead == -1) {
break;
} else {
totalRead++;
}
}
// if we get -1, it'll eventually signal an exit but first
// we must write any characters we have read
// note: it is assumed that the trailing number, which may be
// less than 309 will still have a newline appended. this may
// note be the case
if (totalRead == COUNT) {
buf[totalRead++] = '\n';
}
if (totalRead > 0) {
out.write(buf, 0, totalRead);
if (verifyHash) {
md.update(new String(buf, 0, totalRead).getBytes("UTF-8"));
}
wrote += totalRead;
}
// don't try and read again if we've already hit EOF
if (lastRead != -1) {
lastRead = in.read(buf, 0, 309);
}
}
} catch (IOException e) {
throw new RuntimeException(e);
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException(e);
} finally {
safeClose(in);
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
}
}
创建测试文件的方法:
private static void createFile() {
Writer out = null;
long start = System.nanoTime();
try {
out = new BufferedWriter(new FileWriter(IN_FILE));
Random r = new Random();
for (int i = 0; i < SIZE; i++) {
out.write(CHARS[r.nextInt(CHARS.length)]);
}
} catch (IOException e) {
throw new RuntimeException(e);
} finally {
safeClose(out);
long end = System.nanoTime();
System.out.printf("Created %s size %,d in %,.3f seconds%n",
IN_FILE, SIZE, (end - start) / 1000000000.0d);
}
}
这些都假设:
private static final int SIZE = 200000000;
private static final int COUNT = 309;
private static final char[] CHARS;
private static final char[] BYTES = new char[]{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
private static final String IN_FILE = "E:\\temp\\in.dat";
private static final String CHAR_FILE = "E:\\temp\\char.dat";
private static final String BLOCK_FILE = "E:\\temp\\block.dat";
static {
char[] chars = new char[1000];
int nchars = 0;
for (char c = 'a'; c <= 'z'; c++) {
chars[nchars++] = c;
chars[nchars++] = Character.toUpperCase(c);
}
for (char c = '0'; c <= '9'; c++) {
chars[nchars++] = c;
}
chars[nchars++] = ' ';
CHARS = new char[nchars];
System.arraycopy(chars, 0, CHARS, 0, nchars);
}
运行此测试:
public static void main(String[] args) {
if (!new File(IN_FILE).exists()) {
createFile();
}
charRead(true);
charRead(true);
charRead(false);
charRead(false);
blockRead(true);
blockRead(true);
blockRead(false);
blockRead(false);
}
给出了这个结果(Intel Q9450,Windows 7 64位,8GB RAM,在7200rpm 1.5TB驱动器上运行测试):
Created E:\temp\char.dat size 200,647,249 in 29.690 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 18.177 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.911 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 7.867 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.018 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.949 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 3.958 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.909 seconds. Hash: (not calculated)
结论: SHA1哈希验证非常昂贵,这就是我运行和不运行版本的原因。基本上在热身之后,“高效”版本的速度只有2倍。我猜这个时候文件实际上是在内存中。
如果我颠倒了块的顺序并且char读取,结果是:
Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)
有趣的是,逐字符版本在首次读取文件时会占用更大的初始值。
因此,按照惯例,它是效率和简单性之间的选择。
答案 1 :(得分:2)
打开它并一次读取一个字符,然后将该字符写入需要的位置。保留一个计数器,每次计数器足够大时,写出换行符并将计数器设置为零。
答案 2 :(得分:2)
读入长度为309的字节数组,然后写入读取的字节:
import java.io.*;
public class Test {
public static void main(String[] args) throws Exception {
InputStream in = null;
byte[] chars = new byte[309];
try {
in = new FileInputStream(args[0]);
int read = 0;
while((read = in.read(chars)) != -1) {
System.out.write(chars, 0, read);
System.out.println("");
}
}finally {
if(in != null) {
in.close();
}
}
}
}
答案 3 :(得分:1)
不确定这个解决方案有多好,但你总是可以逐字逐句地阅读它。
例如(使用this网站):
FileInputStream fis = new FileInputStream(file);
char current;
int counter = 0
while (fis.available() > 0) {
current = (char) fis.read();
counter++;
// output current to file
if ((counter%309) = 0) {
//output newline character
}
}
答案 4 :(得分:1)
不要使用BufferedReader
,它会将大部分底层文件保留在内存中。直接使用FileReader
,然后使用read()
方法获取所需数据:
FileReader reader = new FileReader(fileName);
char[] buffer = new char[309];
int charsRead = 0;
while ((charsRead = reader.read(buffer, 0, buffer.length)) == buffer.length)
{
System.out.println(new String(buffer));
}
if (charsRead > 0)
{
// print any trailing chars
System.out.println(new String(buffer, 0, charsRead));
}
答案 5 :(得分:0)
将文件读取器包装在BufferedReader中,然后保持循环,一次读取309个字符。
像(未经测试)的东西:
BufferedReader r = new BufferedReader(new FileReader("yourfile.txt"), 1024);
boolean done = false;
char[] buffer = new char[309];
while(!done)
{
int read = r.read(buffer,0,309);
if(read > 0)
{
//write buffer to dfestination, appending newline
}
else
{
done = true;
}
}
答案 6 :(得分:0)
您可以将程序更改为:
BufferedReader r = null;
r = new BufferedReader(new FileReader(fileName));
char[] data = new char[309];
while (r.read(data, 0, 309) > 0) {
System.out.println(new String(data) + "\n");
}
这是我的头脑而未经过测试。