我正在编写一个谷歌应用引擎java应用程序来获取一个大的(1GB)xml文件,将其拆分为重复的节点,并将每个节点的内容写入云sql数据库。
提取和拆分比数据提取/ db写入要快得多,所以我试图并行处理该部分运行多个线程。问题是处理线程(并行)在几分钟后停止写入数据库并且对中断请求没有响应。
我有一个主要的Import类,我的Fetcher,Splitter和Processor是实现runnable的内部类,我从主类的方法创建并启动线程。 Fetcher写入pipedOutputStream,Splitter从连接的pipedInputStream读取并将各个xml节点作为字符串写入arrayblockingqueue。
我有一个ProcessMonitor线程告诉我,当写入停止时,管道和队列已满或快速填满,这表明问题出在最终处理器线程上。当我只运行一个线程作为最后一步时,这个过程似乎每次都成功(我在1/3完成时将其杀死,因为它需要永远)。
public void runImport() {
// Pipes for fetch->split, queue for split->process
PipedOutputStream pipedOut = new PipedOutputStream();
PipedInputStream pipedIn = new PipedInputStream(2000000);
ArrayBlockingQueue<String> myQueue = new ArrayBlockingQueue<>(100);
// Monitor resources
Thread monitorThread = ThreadManager
.createBackgroundThread(new Monitor(myQueue, pipedOut, pipedIn));
monitorThread.start();
Thread fetchThread = ThreadManager.createBackgroundThread(new Fetcher(
pipedOut));
Thread splitThread = ThreadManager
.createBackgroundThread(new QueueWriter(myQueue, pipedOut,
pipedIn));
// Get xpaths for column values
DbHelper db = new DbHelper();
Map<String, String> columnXpaths = db
.getMap("select columnName, xpath from xpath");
// Create threads for processing row xml strings from queue
Thread[] insertThreads = new Thread[5];
for (int i = 0; i < 5; i++) {
try {
insertThreads[i] = ThreadManager
.createBackgroundThread(new QueueProcessor(myQueue,
columnXpaths));
} catch (Exception ex) {
}
}
// Start threads fetching, splitting, and writing to queue
fetchThread.start();
splitThread.start();
for (Thread t : insertThreads) {
if (t != null)
t.start();
}
}
这是我的xml分割器。 Splitter.split()使用一个xmlEventWriter,我刷新并提供()到正确的end元素的队列。
class QueueWriter implements Runnable {
private final ArrayBlockingQueue<String> queue;
private final PipedOutputStream pipedOut;
private final PipedInputStream pipedIn;
QueueWriter(ArrayBlockingQueue<String> q, PipedOutputStream po,
PipedInputStream pi) {
this.queue = q;
this.pipedOut = po;
this.pipedIn = pi;
}
public void run() {
try {
pipedIn.connect(pipedOut);
InputStream inputStream = new GZIPInputStream(pipedIn);
Splitter.split(inputStream, "repeated_node", queue, queueSize);
} catch (Exception ex) {
}
}
}
这是处理器。请注意,当我评论整个使用方法时,除了顶部的一行将字符串打印到输出外,结果是相同的(在节点上成功运行,如果有几个则失败)。
class QueueProcessor implements Runnable {
private final ArrayBlockingQueue<String> queue;
private final Map<String, String> xpaths;
private final DbHelper db;
QueueProcessor(ArrayBlockingQueue<String> q, Map<String, String> m) {
this.queue = q;
this.xpaths = m;
this.db = new DbHelper();
}
public void run() {
try {
while (true) {
String node;
while ((node = queue.poll()) == null) {
Thread.sleep(100);
}
consume(node);
}
} catch (Exception ex) {
}
}
void consume(String xmlString) {
DocHelper myHelper = new DocHelper(xmlString);
Map<String, String> columnValues = new HashMap<String, String>();
for (Map.Entry<String, String> entry : xpaths.entrySet()) {
columnValues.put(entry.getKey(),
myHelper.getXpathValue(entry.getValue()));
}
db.insert("mytable", columnValues);
}
}
我尝试改变管道缓冲区大小b / w 100k和6M。我尝试过1-2000的队列大小。我还尝试了concurrentLinkedQueue和synchronizedQueue,而不是arrayBlockingQueue。我尝试过put / take和remove / add而不是poll / offer。我在arrayBlockingQueue上尝试过公平。我尝试过使用执行程序,而不是直接创建和启动线程。有些组合失败的速度更快,但唯一可以确定它是成功还是失败的东西就是最后一步的多线程。
任何有关为什么会发生这种情况的见解都将非常感激。