全外连接Clojure中的两个映射序列

时间:2014-02-06 19:00:16

标签: clojure outer-join

假设:

(def seq1 ({:id 1 :val 10} {:id 2 :val 20}))
(def seq2 ({:id 1 :val 12} {:id 3 :val 30}))

在每个序列中,:id值保证在该序列中是唯一的,但不一定是有序的。

这两个地图序列如何通过:id键连接,以便结果如图所示?

{1 [{:id 1 :val 10} {:id 1 :val 12}],
 2 [{:id 2 :val 20} nil            ],
 3 [nil             {:id 3 :val 30}]}

最终结果是成对的地图。这类似于完全外部连接,其中不仅交叉点,而且两个集合的差异也包含在输出中。

4 个答案:

答案 0 :(得分:3)

以下是我提出的答案,但我确信它可以更优雅,或者可能有更好的表现。

(defn seq-to-map [seq key]
  (into {} (map (fn [{id key :as m}] [id m]) seq)))

(defn outer-join-maps [seq1 seq2 key]
  (let [map1 (seq-to-map seq1 key)
        map2 (seq-to-map seq2 key)
        allkeys (set (clojure.set/union (keys map1) (keys map2)))]
    (into {} (map (fn [k] [k [(get map1 k) (get map2 k)]]) allkeys))))

以下测试通过:

(fact {:midje/description "Sequence to map"}

      (seq-to-map [{:a 1, :b 1} {:a 2, :b 2}] :a)
        => {1 {:a 1, :b 1}, 2 {:a 2, :b 2}}

      (seq-to-map [{:a 1, :b 1} {:a 1, :b 2}] :a)
        => {1 {:a 1, :b 2}} ; takes last value when a key is repeated

      (seq-to-map [] :a)
        => {})

(fact {:midje/description "Sequence merging"}
      (let [seq1 [{:id 1 :val 10} {:id 2 :val 20}]
            seq2 [{:id 1 :val 12} {:id 3 :val 30}]]

        (outer-join-maps seq1 seq2 :id)
          => {1 [{:id 1 :val 10} {:id 1 :val 12}],
              2 [{:id 2 :val 20} nil],
              3 [nil {:id 3 :val 30}]}))

答案 1 :(得分:2)

你的答案和其他任何东西一样好,但是我会把它写成

(defn outer-join [field a b]
  (let [lookup #(get % field)
        indexed (for [coll [a b]]
                  (into {} (map (juxt lookup identity) coll)))]
    (into {} (for [key (distinct (mapcat keys indexed))]
               [key (map #(get % key) indexed)]))))

答案 2 :(得分:2)

这是另一个版本,在我的基准测试中使用输入大小为2 + 2,90 + 90,900 + 900,90000 + 99000和300000 + 300000是迄今为止最快的。

(defn outer-join [k xs ys]
  (let [gxs (group-by #(get % k) xs)
        gys (group-by #(get % k) ys)
        kvs (concat (keys gxs) (keys gys))]
    (persistent!
     (reduce (fn [out k]
               (let [l (first (get gxs k))
                     r (first (get gys k))]
                 (assoc! out k [l r])))
             (transient {})
             kvs))))

(我尝试将关键序列包裹在distinct中,但结果导致基准测试速度减慢,涉及小到中等大的输入。这是有道理的:我们需要走两个关键序列无论如何,我们为每个键做的工作量非常小,以至于可能需要做更多工作来避免它。)

这是一个完整性检查和一些Criterium基准测试(将amalloy的版本重命名为outer-join*):

(let [xs [{:id 1 :val 10} {:id 2 :val 20}]
      ys [{:id 1 :val 12} {:id 3 :val 30}]]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
WARNING: Final GC required 3.296446000194027 % of runtime
Evaluation count : 17099160 in 60 samples of 284986 calls.
             Execution time mean : 3.589256 µs
    Execution time std-deviation : 34.976485 ns
   Execution time lower quantile : 3.544196 µs ( 2.5%)
   Execution time upper quantile : 3.666515 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 6596160 in 60 samples of 109936 calls.
             Execution time mean : 9.107578 µs
    Execution time std-deviation : 82.176826 ns
   Execution time lower quantile : 8.993900 µs ( 2.5%)
   Execution time upper quantile : 9.295188 µs (97.5%)
                   Overhead used : 2.295807 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers
Evaluation count : 9298740 in 60 samples of 154979 calls.
             Execution time mean : 6.592289 µs
    Execution time std-deviation : 63.929382 ns
   Execution time lower quantile : 6.506403 µs ( 2.5%)
   Execution time upper quantile : 6.749262 µs (97.5%)
                   Overhead used : 2.295807 ns

Found 4 outliers in 60 samples (6.6667 %)
    low-severe   4 (6.6667 %)
 Variance from outliers : 1.6389 % Variance is slightly inflated by outliers

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 90))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 10 100))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
Evaluation count : 413760 in 60 samples of 6896 calls.
             Execution time mean : 147.182107 µs
    Execution time std-deviation : 1.282179 µs
   Execution time lower quantile : 145.103445 µs ( 2.5%)
   Execution time upper quantile : 149.658348 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 256920 in 60 samples of 4282 calls.
             Execution time mean : 238.166905 µs
    Execution time std-deviation : 1.987980 µs
   Execution time lower quantile : 235.211277 µs ( 2.5%)
   Execution time upper quantile : 242.255072 µs (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 362760 in 60 samples of 6046 calls.
             Execution time mean : 167.301109 µs
    Execution time std-deviation : 1.616075 µs
   Execution time lower quantile : 164.534670 µs ( 2.5%)
   Execution time upper quantile : 170.757257 µs (97.5%)
                   Overhead used : 2.295807 ns

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 900))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 100 1000))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys))
  (c/bench (outer-join-maps xs ys :id)))
Evaluation count : 33840 in 60 samples of 564 calls.
             Execution time mean : 1.754723 ms
    Execution time std-deviation : 29.229644 µs
   Execution time lower quantile : 1.709219 ms ( 2.5%)
   Execution time upper quantile : 1.805009 ms (97.5%)
                   Overhead used : 2.295807 ns
Evaluation count : 22740 in 60 samples of 379 calls.
             Execution time mean : 2.559172 ms
    Execution time std-deviation : 44.520222 µs
   Execution time lower quantile : 2.490201 ms ( 2.5%)
   Execution time upper quantile : 2.657706 ms (97.5%)
                   Overhead used : 2.295807 ns

Found 2 outliers in 60 samples (3.3333 %)
    low-severe   2 (3.3333 %)
 Variance from outliers : 6.2842 % Variance is slightly inflated by outliers
Evaluation count : 30000 in 60 samples of 500 calls.
             Execution time mean : 1.999194 ms
    Execution time std-deviation : 25.723647 µs
   Execution time lower quantile : 1.962350 ms ( 2.5%)
   Execution time upper quantile : 2.045836 ms (97.5%)
                   Overhead used : 2.295807 ns

巨大的输入(不包括outer-join-maps):

(let [xs (map (fn [id] {:id id :val (* 10 id)}) (range 300000))
      ys (map (fn [id] {:id id :val (* 20 id)}) (range 100000 400000))]
  (assert (= (outer-join :id xs ys)
             (outer-join* :id xs ys)
             (outer-join-maps xs ys :id)))
  (c/bench (outer-join :id xs ys))
  (c/bench (outer-join* :id xs ys)))
WARNING: Final GC required 13.371566110062922 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 772.532296 ms
    Execution time std-deviation : 12.710681 ms
   Execution time lower quantile : 744.832577 ms ( 2.5%)
   Execution time upper quantile : 801.098417 ms (97.5%)
                   Overhead used : 2.295807 ns

Found 6 outliers in 60 samples (10.0000 %)
    low-severe   2 (3.3333 %)
    low-mild     3 (5.0000 %)
    high-mild    1 (1.6667 %)
 Variance from outliers : 5.3156 % Variance is slightly inflated by outliers
WARNING: Final GC required 15.51698960336361 % of runtime
Evaluation count : 120 in 60 samples of 2 calls.
             Execution time mean : 949.508151 ms
    Execution time std-deviation : 32.952708 ms
   Execution time lower quantile : 911.054447 ms ( 2.5%)
   Execution time upper quantile : 1.031623 sec (97.5%)
                   Overhead used : 2.295807 ns

Found 4 outliers in 60 samples (6.6667 %)
    low-severe   4 (6.6667 %)
 Variance from outliers : 20.6517 % Variance is moderately inflated by outliers

答案 3 :(得分:1)

如果您不需要每个没有给定密钥的地图的nil,那么merge-with可以非常轻松地处理这个问题。

user> (def seq1 [{:id 1 :val 10} {:id 2 :val 20}])
#'user/seq1 

user> (def seq2 [{:id 1 :val 12} {:id 3 :val 30}]) 
#'user/seq2 

user> (def data (concat seq1 seq2)) 
#'user/data

user> (reduce (partial merge-with (comp vec concat)) 
              (map #(hash-map (:id %) [%]) data))

{1 [{:val 10, :id 1} {:val 12, :id 1}], 
 2 [{:val 20, :id 2}], 
 3 [{:val 30, :id 3}]}