我正在编写一个ETL流程来读取产品数据库中的事件级数据,转换/聚合它并写入分析数据仓库。我正在使用clojure的core.async库将这些进程分成并发执行的组件。以下是我的代码的主要部分现在的样子
(ns data-staging.main
(:require [clojure.core.async :as async])
(:use [clojure.core.match :only (match)]
[data-staging.map-vecs]
[data-staging.tables])
(:gen-class))
(def submissions (make-table "Submission" "Valid"))
(def photos (make-table "Photo"))
(def videos (make-table "Video"))
(def votes (make-table "Votes"))
;; define channels used for sequential data processing
(def chan-in (async/chan 100))
(def chan-out (async/chan 100))
(defn write-thread [table]
"infinitely loops between reading subsequent 10000 rows from
table and ouputting a vector of the rows(maps)
into 'chan-in'"
(while true
(let [next-rows (get-rows table)]
(async/>!! chan-in next-rows)
(set-max table (:max-id (last next-rows))))))
(defn aggregator []
"takes output from 'chan-in' and aggregates it by coupon_id, date.
then adds / drops any fields that are needed / not needed and inputs
into 'chan-out'"
(while true
(->>
(async/<!! chan-in)
aggregate
(async/>!! chan-out))))
(defn read-thread []
"reads data from chan out and interts into Analytics DB"
(while true
(upsert (async/<!! chan-out))))
(defn -main []
(async/thread (write-thread submissions))
(async/thread (write-thread photos))
(async/thread (write-thread videos))
(async/thread-call aggregator)
(async/thread-call read-thread))
正如您所看到的,我将每个os组件放到自己的线程上并使用阻塞&gt; !!打电话给频道。感觉就像使用非阻塞&gt;!对于此用例,调用以及go例程可能更好,特别是对于数据库读取,这些读取花费大部分时间执行i / o并等待产品db中的新行。是这种情况,如果是这样,实施它的最佳方法是什么?我对这两种方法之间的所有权衡以及如何有效地使用go例程有点不清楚。此外,对于如何改进整体架构的任何其他建议都将非常感激!
答案 0 :(得分:17)
就个人而言,我认为你在这里使用线程可能是正确的。 go-blocks的神奇非阻塞性质来自“停放”,这是core.async的状态机使用的一种特殊的伪阻塞 - 但是因为你的数据库调用真正阻塞而不是将状态机置于停放状态,你只是阻止来自core.async线程池的一些线程。它确实取决于你的同步调用需要多长时间,所以这是基准可以提供信息的东西,但我强烈怀疑线程是正确的方法。
一个例外是您的聚合器功能。在我看来,它可以像(def chan-out (map< aggregate chan-in))
一样折叠到chan-out的定义中。
对于go-blocks与线程的一般概述,Martin Trojer写了一个很好的examination of the two approaches,哪一个在哪种情况下更快。 Cliff的Notes版本是go-blocks适合于调整已经异步的库以与core.async一起使用,而线程适合于使异步进程脱离同步部分。例如,如果您的数据库有一个基于回调的API,则go-blocks将是一个明确的胜利。但由于它是同步的,它们不适合。
答案 1 :(得分:3)
我认为在这个ETL案例中使用“go”宏来获得非阻塞线程会更好。
我编写了一个非常简单的代码来实现Extract Transform和Load任务中隐含的同步过程序列
在您的repl上输入以下代码:
(require '[clojure.core.async :as async :refer [<! >! <!! timeout chan alt! go]])
(def output(chan))
(defn extract [origin]
(let [value-extracted (chan)
value-transformed (chan)
value-loaded (chan)]
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-extracted (str origin " > extracted ")))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-transformed (str (<! value-extracted) " > transformed " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! value-loaded (str (<! value-transformed) " > loaded " )))
(go
(<! (timeout (+ 100 (* 100 (rand-int 20))))) ; wait a little
(>! output [origin (<! value-loaded)]))))
(go
(loop [origins-already-loaded []]
(let [[id message] (<! output)
origins-updated (conj origins-already-loaded id)]
(println message)
(println origins-updated)
(recur origins-updated)
)
))
在repl上输入:
(doseq [example (take 10 (range))] (extract example))
1 > extracted > transformed > loaded
[1]
7 > extracted > transformed > loaded
[1 7]
0 > extracted > transformed > loaded
[1 7 0]
8 > extracted > transformed > loaded
[1 7 0 8]
3 > extracted > transformed > loaded
[1 7 0 8 3]
6 > extracted > transformed > loaded
[1 7 0 8 3 6]
2 > extracted > transformed > loaded
[1 7 0 8 3 6 2]
5 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5]
9 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9]
4 > extracted > transformed > loaded
[1 7 0 8 3 6 2 5 9 4]
<强>更新强>
修复的错误是在被删除的函数“wait-a-while”中使用<!! (timeout (+ 100 (* 100 (rand-int 20)))))
,阻止其他函数没有阻塞进程