我有一个日志文件,大小为1.6 GB,包含200万条记录。我正在将日志的内容读入一个通道,执行一些转换并将内容写回到另一个通道。
最后,我将第二个频道的内容写入文件。
我的代码运行正常,并且结果符合预期。但是,整个操作大约需要45秒,这太长了。
我需要减少花费的时间。
(def reader-channel (delay (let [temp (chan)]
(go
(with-open [reader (clojure.java.io/reader "My_Big_Log")]
(doseq [ln (line-seq reader)]
(>! temp ln)))
(close! temp))
temp)))
(def writer-channel (chan))
(defn make-collection [] (loop [my-coll []] (let [item (<!! @reader-channel)]
(if (nil? item)
my-coll
(do (let [temp (re-find #"[a-z]+\.[a-z]+\.[a-z]+" item)]
(recur (conj my-coll temp))))))))
(def transformed-collection (delay (partition-by identity
(remove nil? (sort (make-collection))))))
(defn transform [] (go-loop [counter 0]
(if (>= counter (count @transformed-collection))
(do (close! writer-channel)
(println "Goodbye"))
(do (let [item (str "Referrer " (+ counter 1) ": "
(first (nth @transformed-collection counter)))]
(>! writer-channel item))
(let [item (str "Number of entries associated with this referrer: "
(count (nth @transformed-collection counter)))]
(>! writer-channel item))
(recur (inc counter))))))
(defn write-to-file [] (with-open [wrtr (clojure.java.io/writer "Result.txt" :append true)]
(loop []
(when-let [temp (<!! writer-channel)]
(.write wrtr (str temp "\n"))
(recur)))))
对于缩进和格式错误,我深表歉意。
答案 0 :(得分:1)
transform
每次通过循环都会执行多个非常昂贵的操作。延迟序列上的count
和nth
各自花费O(n)的时间。不用使用任何一种方法,而是用first
和next
懒惰地处理序列。
答案 1 :(得分:1)
我不喜欢打高尔夫球,但这似乎可以减少代码。我们要计算引荐来源网址的频率,所以我们就可以这样做:
def set_centered_text(cell, text):
"""Change contents of `cell` to `text`, aligned center."""
cell.text = text
cell.text_frame.paragraphs[0].alignment = PP_ALIGN.CENTER
set_centered_text(table.cell(1, 1), str(man_version_match))
set_centered_text(table.cell(2, 1), str(man_version_wrong))
set_centered_text(table.cell(3, 1), str(no_version_match - fal_neg))
# ---etc.---
对引荐来源网址进行计数,方法是生成所有200万个引荐来源网址的列表,然后对其进行排序和分区,这意味着您需要携带大量不必要的数据。这是在空间复杂度O(引荐来源网址)而不是O(行)方面做到的,这取决于您的日志,可能会大大减少。
我也不清楚您为什么使用core.async。这样的简单计数将增加很少的内容,并且很难看到代码中发生了什么。
最后-只是个人资料。它会向您显示很多您可能不知道的有关代码的有趣信息。
答案 2 :(得分:0)
sort
速度很慢。加count
和nth
的延迟序列也很昂贵。您可以通过换能器避免它们(以及所有中间序列)。在我的MBP上,大约2秒钟的记录耗时约5秒钟。
(defn transform [input-f output-f]
(let [read-ch (chan 1 (comp (map (partial re-find #"[a-z]+\.[a-z]+\.[a-z]+"))
;; remove other lines
(remove nil?)
;; transducer bag is like a set but with counter. e.g. {"a.b.c" 1 "c.d.e" 3}
(bag)
;; make each map entry as a sequence element (["a.b.c" 1] ["c.d.e" 3])
cat
;; generate output lines
(map-indexed (fn [i [x cnt]]
[(str "Referrer " i ": " x)
(str "Number of entries associated with this referrer: " cnt)]))
;; flatten the output lines (["l1" "l2"] ["l3" "l4"]) => ("l1" "l2" "l3" "l4")
cat))
write-ch (chan)]
;; wire up read-ch to write-ch
(pipe read-ch write-ch true)
;; spin up a thread to read all lines into read-ch
(thread
(with-open [reader (io/reader input-f)]
(<!! (onto-chan read-ch (line-seq reader) true))))
;; write the counted lines to output
(with-open [wtr (io/writer output-f)]
(loop []
(when-let [temp (<!! write-ch)]
(.write wtr (str temp "\n"))
(recur))))))
(time
(transform "input.txt" "output.txt"))
;; => "Elapsed time: 5286.222668 msecs"
这是我使用的“一次性”计数袋:
(defn bag []
(fn [rf]
(let [state (volatile! nil)]
(fn
([] (rf))
([result] (if @state
(try
(rf result @state)
(finally
(vreset! state nil)))
(rf result)))
([result input]
(vswap! state update input (fnil inc 0))
result)))))
以下是示例输出:
Referrer 0: h.i.j
Number of entries associated with this referrer: 399065
Referrer 1: k.l.m
Number of entries associated with this referrer: 400809
Referrer 2: a.b.c
Number of entries associated with this referrer: 400186
Referrer 3: c.d.e
Number of entries associated with this referrer: 399667
Referrer 4: m.n.o
Number of entries associated with this referrer: 400273