我正在用Java测试大型文件(10.000.100行)的处理。
我写了一段代码,从文件中读取并产生指定数量的线程(最多等于CPU的内核),然后将文件的行内容打印到标准输出中。
Main
类如下:
public class Main
{
public static void main(String[] args)
{
int maxThread;
ArrayList<String> linesForWorker = new ArrayList<String>();
if ("MAX".equals(args[1]))
maxThread = Runtime.getRuntime().availableProcessors();
else
maxThread = Integer.parseInt(args[1]);
ExecutorService executor = Executors.newFixedThreadPool(maxThread);
String readLine;
Thread.sleep(1000L);
long startTime = System.nanoTime();
BufferedReader br = new BufferedReader(new FileReader(args[0]));
do
{
readLine= br.readLine();
if ("X".equals(readLine))
{
executor.execute(new WorkerThread((ArrayList) linesForWorker.clone()));
linesForWorker.clear(); // Wrote to avoid storing a list with ALL the lines of the file in memory
}
else
{
linesForWorker.add(readLine);
}
}
while (readLine!= null);
executor.shutdown();
br.close();
if (executor.awaitTermination(1L, TimeUnit.HOURS))
System.out.println("END\n\n");
long endTime = System.nanoTime();
long durationInNano = endTime - startTime;
System.out.println("Duration in hours:" + TimeUnit.NANOSECONDS.toHours(durationInNano));
System.out.println("Duration in minutes:" + TimeUnit.NANOSECONDS.toMinutes(durationInNano));
System.out.println("Duration in seconds:" + TimeUnit.NANOSECONDS.toSeconds(durationInNano));
System.out.println("Duration in milliseconds:" + TimeUnit.NANOSECONDS.toMillis(durationInNano));
}
}
然后WorkerThread
类的结构如下:
class WorkerThread implements Runnable
{
private List<String> linesToPrint;
public WorkerThread(List<String> linesToPrint) { this.linesToPrint = linesToPrint; }
public void run()
{
for (String lineToPrint : this.linesToPrint)
{
System.out.println(String.valueOf(Thread.currentThread().getName()) + ": " + lineToPrint);
}
this.linesToPrint = null; // Wrote to help garbage collector know I don't need the object anymore
}
}
我运行的应用程序将“ 1”和“ MAX”(即CPU核心数,在我的情况下为4)指定为FixedThreadPool的最大线程,并且遇到了以下情况:
FixedThreadPool
中使用1个单线程执行应用程序时,执行时间约为40分钟。 FixedThreadPool
中使用4个线程执行应用程序时,执行时间约为44分钟。有人可以向我解释这种奇怪的行为(至少对我而言)?为什么多线程在这里没有帮助?
P.S。我的机器上有SSD
编辑:我修改了代码,以便线程现在创建一个文件并将其行集写入SSD中的该文件。现在执行时间已减少到约5 s,但我仍然知道该程序的1线程版本在约5292毫秒内运行,而多线程(4线程)在约5773毫秒内运行。
为什么多线程版本还会持续更长的时间?也许每个线程,甚至要写入他的“个人”文件,都必须等待其他线程释放SSD资源才能对其进行访问和写入?