Question

我正在编写一个ETL流程来读取产品数据库中的事件级数据，转换/聚合它并写入分析数据仓库。我正在使用clojure的core.async库将这些进程分成并发执行的组件。以下是我的代码的主要部分现在的样子

    (ns data-staging.main
        (:require [clojure.core.async :as async])
        (:use [clojure.core.match :only (match)]
              [data-staging.map-vecs]
              [data-staging.tables])
        (:gen-class))

    (def submissions (make-table "Submission" "Valid"))
    (def photos (make-table "Photo"))
    (def videos (make-table "Video"))
    (def votes (make-table "Votes"))

    ;; define channels used for sequential data processing
    (def chan-in (async/chan 100))
    (def chan-out (async/chan 100))

    (defn write-thread [table]
        "infinitely loops between reading subsequent 10000 rows from 
         table and ouputting a vector of the rows(maps) 
         into 'chan-in'"
        (while true
            (let [next-rows (get-rows table)]
                (async/>!! chan-in next-rows)
                (set-max table (:max-id (last next-rows))))))

    (defn aggregator []
        "takes output from 'chan-in' and aggregates it by coupon_id, date.
         then adds / drops any fields that are needed / not needed and inputs
         into 'chan-out'"
        (while true
            (->>
                (async/<!! chan-in)
                aggregate
                (async/>!! chan-out))))

    (defn read-thread []
        "reads data from chan out and interts into Analytics DB" 
        (while true 
            (upsert (async/<!! chan-out))))

    (defn -main []
        (async/thread (write-thread submissions))
        (async/thread (write-thread photos))
        (async/thread (write-thread videos))
        (async/thread-call aggregator)
        (async/thread-call read-thread))

正如您所看到的，我将每个os组件放到自己的线程上并使用阻塞＆gt; !!打电话给频道。感觉就像使用非阻塞＆gt;！对于此用例，调用以及go例程可能更好，特别是对于数据库读取，这些读取花费大部分时间执行i / o并等待产品db中的新行。是这种情况，如果是这样，实施它的最佳方法是什么？我对这两种方法之间的所有权衡以及如何有效地使用go例程有点不清楚。此外，对于如何改进整体架构的任何其他建议都将非常感激！

Answer 1

就个人而言，我认为你在这里使用线程可能是正确的。 go-blocks的神奇非阻塞性质来自“停放”，这是core.async的状态机使用的一种特殊的伪阻塞 - 但是因为你的数据库调用真正阻塞而不是将状态机置于停放状态，你只是阻止来自core.async线程池的一些线程。它确实取决于你的同步调用需要多长时间，所以这是基准可以提供信息的东西，但我强烈怀疑线程是正确的方法。

一个例外是您的聚合器功能。在我看来，它可以像(def chan-out (map< aggregate chan-in))一样折叠到chan-out的定义中。

对于go-blocks与线程的一般概述，Martin Trojer写了一个很好的examination of the two approaches，哪一个在哪种情况下更快。 Cliff的Notes版本是go-blocks适合于调整已经异步的库以与core.async一起使用，而线程适合于使异步进程脱离同步部分。例如，如果您的数据库有一个基于回调的API，则go-blocks将是一个明确的胜利。但由于它是同步的，它们不适合。

Answer 2

我认为在这个ETL案例中使用“go”宏来获得非阻塞线程会更好。

我编写了一个非常简单的代码来实现Extract Transform和Load任务中隐含的同步过程序列

在您的repl上输入以下代码：

(require '[clojure.core.async :as async :refer [<! >! <!! timeout chan alt! go]])

(def output(chan))

(defn extract [origin]
  (let [value-extracted (chan)
        value-transformed (chan)
        value-loaded (chan)]
    (go
     (<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
     (>! value-extracted  (str origin " > extracted  ")))
    (go
     (<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
     (>! value-transformed  (str (<! value-extracted) " > transformed " )))
    (go
     (<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
     (>! value-loaded  (str (<! value-transformed) " > loaded " )))
    (go
     (<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
     (>! output  [origin (<! value-loaded)]))))

(go
 (loop [origins-already-loaded []]
   (let [[id message] (<! output)
         origins-updated (conj origins-already-loaded id)]
     (println message)
     (println origins-updated)
     (recur origins-updated)
     )
   ))

在repl上输入：

(doseq [example (take 10 (range))] (extract example))

1 > extracted   > transformed  > loaded 
[1]
7 > extracted   > transformed  > loaded 
[1 7]
0 > extracted   > transformed  > loaded 
[1 7 0]
8 > extracted   > transformed  > loaded 
[1 7 0 8]
3 > extracted   > transformed  > loaded 
[1 7 0 8 3]
6 > extracted   > transformed  > loaded 
[1 7 0 8 3 6]
2 > extracted   > transformed  > loaded 
[1 7 0 8 3 6 2]
5 > extracted   > transformed  > loaded 
[1 7 0 8 3 6 2 5]
9 > extracted   > transformed  > loaded 
[1 7 0 8 3 6 2 5 9]
4 > extracted   > transformed  > loaded 
[1 7 0 8 3 6 2 5 9 4]

<强>更新
修复的错误是在被删除的函数“wait-a-while”中使用<!! (timeout (+ 100 (* 100 (rand-int 20)))))，阻止其他函数没有阻塞进程

何时使用非阻塞＆gt;！ / threads and blocking＆gt; !! / goroutines with clojure core.async

2 个答案: