Question

我有一个地图矢量，类似于下面的内容，尽管每个数据集中最多可能有100个地图：

data({ a:a b:"2" c:t}{ a:b b:"0" c:t}{ a:c b:"-4" c:t}{ a:d b:"100" c:t}{ a:e b:"50" c:t})

我需要产生以下总和：b

values(map :b data)
sum(reduce + (map read-string values)

这给出了期望的结果，但计算时间很长，约为每秒1/10。我正在为数十万个数据集执行此操作，因此需要花费大量处理时间来完成此操作。

有人能建议采用这种方法更有效/更快的方法吗？

由于

Answer 1

在Clojure 1.2.1中，您的总100.000数据集场景的1/10在略超过1/10秒的时间内完成。它基本上是你的代码（它不是真正有效的clojure语法，但我们得到了要点），但不知何故运行速度高达10.000倍。

;generate 10.000 datasets of 100 maps having 10 fields each

(def scenario-data
    (vec (repeatedly 10000
                     (fn [] (vec (repeatedly 100 (fn [] (zipmap
                                                            [:a :b :c :d :e :f :g :h :i :j]
                                                            (repeatedly (fn [] (str (- (rand-int 2000) 1000))))))))))))


;now map the datasets into the reduced sums of the parsed :b fields of each dataset

(time (doall (map (fn [dataset] (reduce (fn [acc mp] (+ acc (Integer/parseInt (:b mp)))) 0 dataset))
                  scenario-data)))
"Elapsed time: 120.43267 msecs"
=> (2248 -6383 7890 ...)

由于这种情况非常耗费内存（ 10.000数据集〜= 600MB，总计算使用~4GB ），我无法在家用计算机上运行100.000数据集方案。但是，如果我不将数据集保存在内存中，我可以运行它，但是可以映射一个懒惰的序列，而不必坚持它的头部。

(time (doall (map (fn [dataset] (reduce (fn [acc mp] (+ acc (Integer/parseInt (:b mp)))) 0 dataset))
                  (repeatedly 100000
                              (fn [] (repeatedly 100 (fn [] (zipmap
                                                              [:a :b :c :d :e :f :g :h :i :j]
                                                              (repeatedly (fn [] (str (- (rand-int 2000) 1000))))))))))))
"Elapsed time: 30242.371308 msecs"
=> (-4975 -843 1560 ...)

计算100.000数据集版本的时间为30秒，而包括生成数据所需的所有时间。使用pmap可将时间缩短一半（4个核心）。

编辑：在具有足够内存的计算机上创建完全实现的100.000数据集需要135秒。在其上运行求和代码需要大约1500毫秒。使用pmap将其减少到约750毫秒。 read-string版本慢了约3.5倍。

TL / DR：如果有足够的内存，您发布的算法可以在1秒内在100.000数据集场景中运行。

请发布您的完整代码，包括您如何阅读并将数据集保存在内存中，并确保此次语法和观察结果都准确无误。由于没有从源头懒洋洋地读取数据集，这可能更多是一个记忆问题。

Answer 2

您可以尝试使用Integer/parseInt或Long/parseLong代替更一般的read-string。

[编辑]

使用Clojure 1.5.1进行的简单测试表明，parseInt的速度提高了大约10倍：

user=> (time (dotimes [n 100000] (read-string "10")))
"Elapsed time: 142.516849 msecs"
nil

user=> (time (dotimes [n 100000] (Integer/parseInt "10")))
"Elapsed time: 12.754187 msecs"
nil

Answer 3

一种可能性是使用并行运行的reducers：

(require '[clojure.core.reducers :as r])
(r/reduce + (r/map read-string values))

对于小型测试用例，这不会改善运行时，但对于大型数据集，它应该。

从地图clojure的向量求和字符串值

3 个答案: