clojure简单马尔可夫数据转换

时间:2013-11-25 21:38:13

标签: clojure hashmap markov

如果我有一个单词的向量,例如[“john”“说”......“john”“走了”...] 我想制作每个单词的哈希映射和下一个单词的出现次数,例如{“john”{“说”1“走了”1“踢了”3}}

我提出的最佳解决方案是通过索引递归遍历列表并使用assoc来继续更新哈希映射,但这似乎非常混乱。有没有更惯用的方法呢?

2 个答案:

答案 0 :(得分:6)

鉴于你有话:

(def words ["john" "said" "lara" "chased" "john" "walked" "lara" "chased"])

使用此transformation-fn

(defn transform
  [words]
  (->> words
       (partition 2 1)
       (reduce (fn [acc [w next-w]]
                 ;; could be shortened to #(update-in %1 %2 (fnil inc 0))
                 (update-in acc
                            [w next-w]
                            (fnil inc 0))) 
               {})))

(transform words)
;; {"walked" {"lara" 1}, "chased" {"john" 1}, "lara" {"chased" 2}, "said" {"lara" 1}, "john" {"walked" 1, "said" 1}}

编辑:你可以使用像这样的瞬态哈希映射来获得性能:

(defn transform-fast
  [words]
  (->> (map vector words (next words))
       (reduce (fn [acc [w1 w2]]
                 (let [c-map (get acc w1 (transient {}))]
                   (assoc! acc w1 (assoc! c-map w2
                                          (inc (get c-map w2 0))))))
               (transient {}))
       persistent!
       (reduce-kv (fn [acc w1 c-map]
                    (assoc! acc w1 (persistent! c-map)))
                  (transient {}))
       persistent!))

显然,生成的源代码看起来不太好,只有在关键时才会发生这种优化。

(Criterium称,它击败MichałMarczykstransform*的速度大约是李尔王的两倍。

答案 1 :(得分:5)

(更新:请参阅下面的中间产品使用java.util.HashMap的解决方案 - 最终结果仍然是完全持久的 - 这是最快的,比{{1}优势2.35倍在李尔王的基准测试中。)

基于

transform-fast的解决方案

这是一个更快的解决方案,从李尔王(Lee Lear)获取的单词大约为1.7(见下面的确切方法),在样本merge-with上几乎是3倍:

words

传递给(defn transform* [words] (apply merge-with #(merge-with + %1 %2) (map (fn [w nw] {w {nw 1}}) words (next words)))) 的函数也可以写成

map
尽管采用这种方法的时间并不是那么好。 (我仍然在下面的基准中将此版本包含为#(array-map %1 (array-map %2 1)), 。)

首先,进行健全检查:

transform**
使用测试输入的

Criterium基准测试(带;; same input (def words ["john" "said" "lara" "chased" "john" "walked" "lara" "chased"]) (= (transform words) (transform* words) (transform** words)) ;= true 的OpenJDK 1.7):

-XX:+UseConcMarkSweepGC

最后,使用King Lear as found on Project Gutenberg的更有趣的基准(在处理之前没有费心去除法律声明等):

(do (c/bench (transform words))
    (flush)
    (c/bench (transform* words))
    (flush)
    (c/bench (transform** words)))
Evaluation count : 4345080 in 60 samples of 72418 calls.
             Execution time mean : 13.945669 µs
    Execution time std-deviation : 158.808075 ns
   Execution time lower quantile : 13.696874 µs ( 2.5%)
   Execution time upper quantile : 14.295440 µs (97.5%)
                   Overhead used : 1.612143 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 12998220 in 60 samples of 216637 calls.
             Execution time mean : 4.705608 µs
    Execution time std-deviation : 63.133406 ns
   Execution time lower quantile : 4.605234 µs ( 2.5%)
   Execution time upper quantile : 4.830540 µs (97.5%)
                   Overhead used : 1.612143 ns

Found 1 outliers in 60 samples (1.6667 %)
    low-severe   1 (1.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 10847220 in 60 samples of 180787 calls.
             Execution time mean : 5.706852 µs
    Execution time std-deviation : 73.589941 ns
   Execution time lower quantile : 5.560404 µs ( 2.5%)
   Execution time upper quantile : 5.828209 µs (97.5%)
                   Overhead used : 1.612143 ns
基于

(def king-lear (slurp (io/file "/path/to/pg1128.txt"))) (def king-lear-words (-> king-lear (string/lower-case) (string/replace #"[^a-z]" " ") (string/trim) (string/split #"\s+"))) (do (c/bench (transform king-lear-words)) (flush) (c/bench (transform* king-lear-words)) (flush) (c/bench (transform** king-lear-words))) Evaluation count : 720 in 60 samples of 12 calls. Execution time mean : 87.012898 ms Execution time std-deviation : 833.381589 µs Execution time lower quantile : 85.772832 ms ( 2.5%) Execution time upper quantile : 88.585741 ms (97.5%) Overhead used : 1.612143 ns Evaluation count : 1200 in 60 samples of 20 calls. Execution time mean : 51.786860 ms Execution time std-deviation : 587.029829 µs Execution time lower quantile : 50.854355 ms ( 2.5%) Execution time upper quantile : 52.940274 ms (97.5%) Overhead used : 1.612143 ns Evaluation count : 1020 in 60 samples of 17 calls. Execution time mean : 61.287369 ms Execution time std-deviation : 720.816107 µs Execution time lower quantile : 60.131219 ms ( 2.5%) Execution time upper quantile : 62.960647 ms (97.5%) Overhead used : 1.612143 ns 的解决方案

全力以赴,使用中间状态的可变哈希映射和java.util.HashMap / loop可以做得更好,以避免在对词对进行循环时出现:

recur

(defn t9 [words] (let [m (java.util.HashMap.)] (loop [ws words nws (next words)] (if nws (let [w (first ws) nw (first nws)] (if-let [im ^java.util.HashMap (.get m w)] (.put im nw (inc (or (.get im nw) 0))) (let [im (java.util.HashMap.)] (.put im nw 1) (.put m w im))) (recur (next ws) (next nws))) (persistent! (reduce (fn [out k] (assoc! out k (clojure.lang.PersistentHashMap/create ^java.util.HashMap (.get m k)))) (transient {}) (iterator-seq (.iterator (.keySet m))))))))) clojure.lang.PersistentHashMap/create类中的静态方法,无疑是一个实现细节。 (但不太可能在不久的将来改变 - 目前在Clojure中为内置地图类型创建的所有地图都通过这样的静态方法。)

完整性检查:

PHM

基准测试结果:

(= (transform king-lear-words) (t9 king-lear-words))
;= true