Question

我正在编写一个谷歌应用引擎java应用程序来获取一个大的（1GB）xml文件，将其拆分为重复的节点，并将每个节点的内容写入云sql数据库。

提取和拆分比数据提取/ db写入要快得多，所以我试图并行处理该部分运行多个线程。问题是处理线程（并行）在几分钟后停止写入数据库并且对中断请求没有响应。

我有一个主要的Import类，我的Fetcher，Splitter和Processor是实现runnable的内部类，我从主类的方法创建并启动线程。 Fetcher写入pipedOutputStream，Splitter从连接的pipedInputStream读取并将各个xml节点作为字符串写入arrayblockingqueue。

我有一个ProcessMonitor线程告诉我，当写入停止时，管道和队列已满或快速填满，这表明问题出在最终处理器线程上。当我只运行一个线程作为最后一步时，这个过程似乎每次都成功（我在1/3完成时将其杀死，因为它需要永远）。

public void runImport() {

    // Pipes for fetch->split, queue for split->process
    PipedOutputStream pipedOut = new PipedOutputStream();
    PipedInputStream pipedIn = new PipedInputStream(2000000);
    ArrayBlockingQueue<String> myQueue = new ArrayBlockingQueue<>(100);

    // Monitor resources
    Thread monitorThread = ThreadManager
            .createBackgroundThread(new Monitor(myQueue, pipedOut, pipedIn));
    monitorThread.start();

    Thread fetchThread = ThreadManager.createBackgroundThread(new Fetcher(
            pipedOut));

    Thread splitThread = ThreadManager
            .createBackgroundThread(new QueueWriter(myQueue, pipedOut,
                    pipedIn));

    // Get xpaths for column values
    DbHelper db = new DbHelper();
    Map<String, String> columnXpaths = db
            .getMap("select columnName, xpath from xpath");

    // Create threads for processing row xml strings from queue
    Thread[] insertThreads = new Thread[5];

    for (int i = 0; i < 5; i++) {
        try {
            insertThreads[i] = ThreadManager
                    .createBackgroundThread(new QueueProcessor(myQueue,
                            columnXpaths));
        } catch (Exception ex) {
        }
    }

    // Start threads fetching, splitting, and writing to queue
    fetchThread.start();
    splitThread.start();
    for (Thread t : insertThreads) {
        if (t != null)
            t.start();
    }
}

这是我的xml分割器。 Splitter.split（）使用一个xmlEventWriter，我刷新并提供（）到正确的end元素的队列。

class QueueWriter implements Runnable {

    private final ArrayBlockingQueue<String> queue;
    private final PipedOutputStream pipedOut;
    private final PipedInputStream pipedIn;

    QueueWriter(ArrayBlockingQueue<String> q, PipedOutputStream po,
            PipedInputStream pi) {
        this.queue = q;
        this.pipedOut = po;
        this.pipedIn = pi;
    }

    public void run() {
        try {
            pipedIn.connect(pipedOut);
            InputStream inputStream = new GZIPInputStream(pipedIn);
            Splitter.split(inputStream, "repeated_node", queue, queueSize);

        } catch (Exception ex) {
        }
    }
}

这是处理器。请注意，当我评论整个使用方法时，除了顶部的一行将字符串打印到输出外，结果是相同的（在节点上成功运行，如果有几个则失败）。

class QueueProcessor implements Runnable {

    private final ArrayBlockingQueue<String> queue;
    private final Map<String, String> xpaths;
    private final DbHelper db;

    QueueProcessor(ArrayBlockingQueue<String> q, Map<String, String> m) {
        this.queue = q;
        this.xpaths = m;
        this.db = new DbHelper();
    }

    public void run() {
        try {
            while (true) {
                String node;
                while ((node = queue.poll()) == null) {
                    Thread.sleep(100);
                }
                consume(node);
            }
        } catch (Exception ex) {
        }
    }

    void consume(String xmlString) {

        DocHelper myHelper = new DocHelper(xmlString);
        Map<String, String> columnValues = new HashMap<String, String>();

        for (Map.Entry<String, String> entry : xpaths.entrySet()) {
            columnValues.put(entry.getKey(),
                    myHelper.getXpathValue(entry.getValue()));
        }

        db.insert("mytable", columnValues);
    }
}

我尝试改变管道缓冲区大小b / w 100k和6M。我尝试过1-2000的队列大小。我还尝试了concurrentLinkedQueue和synchronizedQueue，而不是arrayBlockingQueue。我尝试过put / take和remove / add而不是poll / offer。我在arrayBlockingQueue上尝试过公平。我尝试过使用执行程序，而不是直接创建和启动线程。有些组合失败的速度更快，但唯一可以确定它是成功还是失败的东西就是最后一步的多线程。

任何有关为什么会发生这种情况的见解都将非常感激。

谷歌应用引擎后端后台线程并行运行时无响应

0 个答案: