我有一个函数可以同时计算文本文件中某些特征的频率并整理数据。该函数的输出是存储在持久映射中的数千个频率分布。举个简单的例子:
{"dogs" {"great dane" 2, "poodle" 4}, "cats" {"siamese" 1 "tom" 3}}
以及产生此代码的代码:
(defn do-the-thing-1 [lines species_list]
;; we know the full list of species beforehand so to avoid thread contention
;; for a single resource, make an atom for each species
(let [resultdump (reduce #(assoc %1 %2 (atom {})) {} species_list)
line-processor (fn [line]
(fn [] ; return a function that will do the work when invoked
(doseq [[species breed] (extract-pairs line)]
(swap! ; increase the count for this species-breed pair
(resultdump species)
update-in [breed] #(+ 1 (or % 0))))))
pool (Executors/newFixedThreadPool 4)]
;; queue up the tasks
(doseq [future (.invokeAll pool (map line-processor lines))]
(.get future))
(.shutdown pool)
(deref-vals result)))
(defn deref-vals [species_map]
(into {} (for [[species fdist] species_map] [species @fdist]))
这很好用。问题是我需要在使用它们之前将它们转换为概率分布。 e.g。
{"dogs" {"great dane" 1/3, "poodle" 2/3}, "cats" {"siamese" 1/4, "tom" 3/4}}
这是执行此操作的功能:
(defn freq->prob
"Converts a frequency distribution into a probability distribution"
[fdist]
(let [sum (apply + (vals fdist))]
(persistent!
(reduce
(fn [dist [key val]] (assoc! dist key (/ val sum)))
(transient fdist)
(seq fdist)))))
在处理管道中的下一步消耗分布时,即时执行此转换可提供合理的速度,但也会有相当多的冗余转换,因为某些分布不止一次使用。当我修改我的函数以在返回结果之前并行执行转换时,后续处理阶段发生的速度急剧下降。
这是修改后的功能:
(defn do-the-thing-2 [lines species_list]
;; we know the full list of species beforehand so to avoid thread contention
;; for a single resource, make an atom for each species
(let [resultdump (reduce #(assoc %1 %2 (atom {})) {} species_list)
line-processor (fn [line]
(fn [] ; return a function that will do the work when invoked
(doseq [[species breed] (extract-pairs line)]
(swap! ; increase the count for this species-breed pair
(resultdump species)
update-in [breed] #(+ 1 (or % 0))))))
pool (Executors/newFixedThreadPool 4)]
;; queue up the tasks
(doseq [future (.invokeAll pool (map line-processor lines))]
(.get future))
;; this is the only bit that has been added
(doseq [future (.invokeAll pool (map
(fn [fdist_atom]
#(reset! fdist_atom (freq->prob @fdist_atom)))
(vals resultdump)))]
(.get future))
(.shutdown pool)
(deref-vals result)))
所以是的,虽然返回的数据是相同的,但是这使得之后的所有内容比在每次访问生成的地图时简单地调用freq->prob
时慢大约10倍。任何人都可以提出理由,说明为什么会这样或者我可以做些什么呢?
freq->prob
函数来创建浮点数或双精度而不是分数,那么在预先计算概率分布而不是在运行中生成它们时,性能会得到改善。可能是在原子中创建的分数比在原子外创建的分数运行得慢吗?我刚刚运行了一些简单的测试,表明情况并非如此,所以肯定会发生一些奇怪的事情。
答案 0 :(得分:1)
我不是100%确定我遵循了你的逻辑,但你的地图功能在这里:
(map
(fn [fdist_atom]
#(reset! fdist_atom (freq->prob @fdist_atom)))
(vals resultdump))
看起来不正确。如果要基于旧值更新原子,则swap!
比应用于原子的解除引用值的函数更适合reset!
。这似乎更好:
(map
(fn [fdist_atom] (swap! fdist_atom freq->prob))
(vals resultdump))
答案 1 :(得分:0)
关于转换概率分布的问题。
如果你像这样重写'freq-prob':
(defn cnv-freq [m]
(let [t (apply + (vals m))]
(into {} (map (fn [[k v]] [k (/ v t)]) m))))
(defn freq-prob [m]
(into {} (pmap (fn [[k v]] [k (cnv-freq v)]) m)))
您可以通过将'pmap'更改为'map'来启用/禁用并行执行。