这是针对hive查询中的自定义UDTF,CreateLogTable
是UDTF类,我将其用作测试的temp。我正在为每个file to be downloaded from Amazon S3
创建一个线程,并等待另一个线程变为可用,然后再将另一个文件分配给线程。
主要测试逻辑:
CreateLogTable CLT = new CreateLogTable();
int numThreads = 2;
int index = 0;
DownloadFileThread[] dlThreads = new DownloadFileThread[numThreads];
for (S3ObjectSummary oSummary : bucketKeys.getObjectSummaries()) {
while (dlThreads[index] != null && dlThreads[index].isAlive()) {
index += 1;
index = index % numThreads;
}
dlThreads[index] = new DownloadFileThread(CLT , getBucket(oSummary.getBucketName() + "/"
+ oSummary.getKey()), getFile(oSummary.getKey()), index);
dlThreads[index].start();
index += 1;
index = index % numThreads;
}
主题类(run()
方法):
try {
System.out.println("Creating thread " + this.threadnum);
this.fileObj = this.S3CLIENT.getObject(new GetObjectRequest(this.filePath, this.fileName));
this.fileIn = new Scanner(new GZIPInputStream(this.fileObj.getObjectContent()));
while (this.fileIn.hasNext()) {
this.parent.forwardToTable(fileIn.nextLine());
}
System.out.println("Finished " + this.threadnum);
} catch (Throwable e) {
System.out.println("Downloading of " + this.fileName + " failed.");
}
线程创建之前的while循环应该循环,直到它找到null thread
或dead thread
,直到它退出循环,在这种情况下,将创建并启动new thread
。由于我将日志记录包含在控制台中,因此我能够观察此过程,但输出是意外的:
Creating thread 0
Creating thread 1
Creating thread 0
Creating thread 1
Creating thread 0
Creating thread 1
Creating thread 0
...
Creating thread 1
Creating thread 0
Creating thread 1
Finished 0
Finished 1
Finished 1
Finished 0
Finished 1
Finished 1
...
Finished 0
Finished 1
Finished 0
Finished 1
以上只是输出的前几行。问题是在任何线程完成任务之前创建了两个以上的线程。
为什么会发生这种情况?我该如何解决这个问题?
答案 0 :(得分:3)
尝试看一下这个例子:
public static void main(String[] args) {
ExecutorService executor = Executors.newFixedThreadPool(5);
for (int i = 0; i < 10; i++) {
Runnable worker = new WorkerThread("" + i);
executor.execute(worker);
}
executor.shutdown();
while (!executor.isTerminated()) {
}
System.out.println("Finished all threads");
}
它是一个使用Java 8的线程池。使用Executors制作它的一种非常简单的方法。非常直观的前进方式。
答案 1 :(得分:3)
我将代码缩减到了这个测试用例:
public class ThreadTest {
private static class SleepThread extends Thread {
private final int index;
SleepThread(int ii) { index = ii; }
@Override
public void run() {
System.out.println("Creating thread " + this.index);
try {
Thread.sleep(5_000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("Finished " + this.index);
}
}
public static void main(String[] args) {
int numThreads = 2;
int index = 0;
SleepThread[] dlThreads = new SleepThread[numThreads];
for (int ii = 0; ii < 10; ++ii) {
while (dlThreads[index] != null && dlThreads[index].isAlive()) {
index += 1;
index = index % numThreads;
}
dlThreads[index] = new SleepThread(index);
dlThreads[index].start();
index += 1;
index = index % numThreads;
}
}
}
使用Sun JDK 1.7.0_75,运行它会产生您期望的结果 - 两个线程启动,它们在五秒后退出,另外两个线程启动,依此类推。
我怀疑的下一件事是你的JVM的Thread.isAlive()
实现在启动后不会立即返回true,尽管这似乎与Thread
类的文档相反。
答案 2 :(得分:1)
上述代码无法正常工作的原因是因为调用isAlive()
时出现了一些古怪的事情。
出于某种原因,无论线程处于什么状态,isAlive()
总是会为我返回false
,从而导致创建越来越多的线程,这些线程会替换数组中的旧线程, dlThreads
。
我通过创建自定义isWorking()
方法解决了这个问题,该方法只返回线程的run()
方法是否已完成的布尔值。以下是Thread
类现在的样子:
//this.isWorking initialized to true during instantiation
@Override
public void run() {
try {
System.out.println("Creating thread " + this.threadnum + " for " + filePath + "/" + fileName);
this.fileObj = this.S3CLIENT.getObject(new GetObjectRequest(this.filePath, this.fileName));
this.fileIn = new Scanner(new GZIPInputStream(this.fileObj.getObjectContent()));
while (this.fileIn.hasNext()) {
this.parent.forwardToTable(fileIn.nextLine());
}
System.out.println("Finished " + this.threadnum);
this.isWorking = false;
} catch (Throwable e) {
System.out.println("Downloading of " + this.fileName + " failed.");
e.printStackTrace();
this.isWorking = false;
}
}
public boolean isWorking(){
return this.isWorking;
}
然而,在实现这一点并对我的多线程脚本工作满意后,我按照其他用户的建议切换到使用Executor
,这略微提高了性能并使代码更加清晰。