Question

我的应用程序中有以下代码，它执行两项操作：

解析具有'n'个数据的文件。

对于文件中的每个数据，将有两个Web服务调用。

 public static List<String> parseFile(String fileName) {
   List<String> idList = new ArrayList<String>();
   try {
     BufferedReader cfgFile = new BufferedReader(new FileReader(new File(fileName)));
     String line = null;
     cfgFile.readLine();
     while ((line = cfgFile.readLine()) != null) {
       if (!line.trim().equals("")) {
         String [] fields = line.split("\\|"); 
         idList.add(fields[0]);
       } 
     } 
     cfgFile.close();
   } catch (IOException e) {
     System.out.println(e+" Unexpected File IO Error.");
   }
 return idList;
}

当我尝试解析具有100万行记录的文件时，java进程在处理了一定数量的数据后失败。我收到java.lang.OutOfMemoryError: Java heap space错误。我可以部分地弄清楚java进程因为提供了这么大的数据而停止了。请建议我如何处理这些庞大的数据。

编辑：代码new BufferedReader(new FileReader(new File(fileName)));的这一部分是否会解析整个文件并受到文件大小的影响。

Answer 1

您遇到的问题是您正在累积列表中的所有数据。解决这个问题的最佳方法是以流媒体方式进行。这意味着不要累积列表中的所有ID，而是在每行调用您的Web服务或累积较小的缓冲区然后进行调用。

打开文件并创建BufferedReader对内存消耗没有影响，因为文件中的字节将逐行读取（或多或少）。问题出现在代码idList.add(fields[0]);中，当您继续将所有文件数据累积到文件中时，列表将随文件一样大。

您的代码应该执行以下操作：

 while ((line = cfgFile.readLine()) != null) {
   if (!line.trim().equals("")) {
     String [] fields = line.split("\\|"); 
     callToRemoteWebService(fields[0]);
   } 
 }

Answer 2

使用-Xms和-Xmx选项增加Java堆内存大小。如果没有明确设置，jvm会将堆大小设置为符合人体工程学的默认值，在您的情况下，这是不够的。阅读本文以了解有关在jvm中调整内存的更多信息：http://www.oracle.com/technetwork/java/javase/tech/memorymanagement-whitepaper-1-150020.pdf

编辑：以生产者 - 消费者方式执行此操作的另一种方式来利用并行处理。一般的想法是创建一个生成器线程来读取文件并将任务排队以进行处理，并创建消耗它们的消费者线程。一个非常一般的想法（出于说明目的）如下：

// blocking queue holding the tasks to be executed
final SynchronousQueue<Callable<String[]> queue = // ...

// reads the file and submit tasks for processing
final Runnable producer = new Runnable() {
  public void run() {
     BufferedReader in = null;
     try {
         in = new BufferedReader(new FileReader(new File(fileName)));
         String line = null;
         while ((line = file.readLine()) != null) {
             if (!line.trim().equals("")) {
                 String[] fields = line.split("\\|"); 
                 // this will block if there are not available consumer threads to process it...
                 queue.put(new Callable<Void>() {
                     public Void call() {
                         process(fields);
                     }
                  });
              } 
          }
     } catch (InterruptedException e) {
         Thread.currentThread().interrupt());
     } finally {
         // close the buffered reader here...
     }
  }
}

// Consumes the tasks submitted from the producer. Consumers can be pooled
// for parallel processing.
final Runnable consumer = new Runnable() {
  public void run() {
    try {
        while (true) {
            // this method blocks if there are no items left for processing in the queue...
            Callable<Void> task = queue.take();
            taks.call();
        }
    } catch (InterruptedException e) {
        Thread.currentThread().interrupt();
    }
  }
}

当然，您必须编写管理使用者和生产者线程生命周期的代码。正确的方法是使用Executor实现它。

Answer 3

如果您想使用大数据，您有两种选择：

使用足够大的堆来容纳所有数据。这将“工作”一段时间，但如果你的数据大小无限制，它最终会失败。
逐步处理数据。只在任何时候将部分数据（有限大小）保留在内存中。这是理想的解决方案，因为它可以扩展到任意数量的数据。

堆大小问题 - 使用java进行内存管理

3 个答案: