我正在尝试根据键值的相似性合并地图的键,以生成一个新的地图,其键值类似于合并为一个。以下是我的代码来说明我的想法:
给定数据集:
(def engineer-visits (incanter.core/dataset ["Engineer" "Credit" "Comments"]
[
["Jonah" 1 "OK"]
["Jonah" 2 "Very good"]
["Joneh" 0 "Not very good"]
["Joneh" 3 "Excellent"]
["Esther" 2 "Missing comment"]
["Esther" 4 "Extraordinary"]
]
))
有价值:
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
以下内容生成从工程师到他/她的记录的地图:
(def by-engineers (incanter.core/$group-by "Engineer" engineer-visits ))
有价值:
{{"Engineer" "Jonah"}
| Engineer | Credit | Comments |
|----------+--------+-----------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
, {"Engineer" "Joneh"}
| Engineer | Credit | Comments |
|----------+--------+---------------|
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
, {"Engineer" "Esther"}
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
}
使用以下功能,我想得到:
(map-merged-by-key-value-similarity by-engineers 0.8)
{{"Engineer" "Jonah"}
| Engineer | Credit | Comments |
|----------+--------+---------------|
| Jonah | 1 | OK |
| Jonah | 2 | Very good |
| Joneh | 0 | Not very good |
| Joneh | 3 | Excellent |
, {"Engineer" "Esther"}
| Engineer | Credit | Comments |
|----------+--------+-----------------|
| Esther | 2 | Missing comment |
| Esther | 4 | Extraordinary |
}
(defn map-merged-by-key-value-similarity
"From a map produced by $gorup-by on a datasest, produce a map of the same structure, with key column values merged by similarity."
[a-map threshold]
(let [
column-keys (keys a-map)
key-column-name (->> column-keys
first
keys
first)
;; Deconstruct the key column values from the key of the map, i.e. the pair of column name and column value:
key-column-values (flatten (map vals column-keys))
;; Compute string clusters for the values:
value-simularity-cluster (similarity-cluster key-column-values threshold)
;; Reconstruct the key for the updated map from the clustered column values:
reconstructed-column-value-key-cluster-list (map (fn [cluster]
(map (fn [name]
{key-column-name name})
cluster))
value-simularity-cluster)
representative (fn [cluster] (first cluster)) ; out of a cluster
map-from-cluster-combined-fn (fn [cluster]
; the cluster is a list of maps from key-column-mane to string of the column's value
(if (< 1 (count cluster))
;; combine
(apply merge-with conj-rows (map (fn [key]
{(representative cluster) (a-map key)})
cluster))
;; as is
{(first cluster) (a-map (first cluster))}
))
]
(apply merge (map map-from-cluster-combined-fn reconstructed-column-value-key-cluster-list))
)
)
上述功能确实按预期工作。我希望有一种更惯用的方法来实现它。由于存在一个非常对称的过程,即分解地图的键和值,处理键,并重建一个类似的地图,我觉得它可以更有说服力地完成。我依稀记得在Scala中,一些Mondard运算符可能对访问和处理深埋在列表结构中的信息很有用。
感谢您的评论或帮助!
注意:similarity-cluster
将字符串列表转换为字符串列表列表,其中类似的字符串放入封闭列表中。这是我的实施。详细信息与我的问题无关。
答案 0 :(得分:2)
当您使用表(具有相同键的地图矢量)而不是Incanter数据集时,事情会稍微容易一些。但是,有几种更紧急的功能可以在它们之间切换。
此外,虽然您可能认为您的similarity-cluster
实施并不相关,但至少发布一些类似的内容会使人们更容易通过批次回答您的问题工作代码。
对于measuring similarity between strings我使用此Purely Functional Levenshtein Distance作为levenshtein-distance
函数,并使用了3个编辑的截止点:
(def engineer-visits
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}
{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}])
(defn similarity-matrix
[coll]
(into {} (for [x coll, y coll
:when (< (levenshtein-distance x y) 3)]
[x y])))
(def similarity
(similarity-matrix (distinct (map :engineer engineer-visits))))
=> {"Jonah" "Joneh", "Joneh" "Joneh", "Esther" "Esther"}
(group-by #(get similarity (:engineer %)) engineer-visits)
=>
{"Joneh"
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}],
"Esther"
[{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}]}
值得注意的是,通过将相似性矩阵的元素放入哈希映射中,["Jonah","Jonah"]
键值对将被以下["Jonah","Joneh"]
对覆盖。 ["Joneh","Jonah"]
后跟["Joneh","Joneh"]
也是如此。这对结果很有帮助。
答案 1 :(得分:0)
受到Niels&#39;的启发回答让我的问题更清楚,这应该是我的问题,因为不相关的部分:
给定一个表格,以及一种聚集列&#34;:工程师值的方法,以及从群集中选择代表值的方法,表达式是什么从这些代表构造一个映射到表中相应的行?
这是蒸馏溶液。再次感谢Niels&#39;回答。
(def engineer-visits
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}
{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}])
(defn clusters [names] '(("Jonah" "Joneh") ("Esther")))
(defn representative [cluster] (first cluster))
(def representatives
(->> engineer-visits
(map :engineer)
distinct
clusters
(map (fn [cluster] (apply merge (map (fn [name] {name (representative cluster)}) cluster))))
(apply merge)
))
(group-by #(get representatives (:engineer %)) engineer-visits)
结果=&gt;
{"Jonah"
[{:comments "OK", :engineer "Jonah", :credit 1}
{:comments "Very good", :engineer "Jonah", :credit 2}
{:comments "Not very good", :engineer "Joneh", :credit 0}
{:comments "Excellent", :engineer "Joneh", :credit 3}],
"Esther"
[{:comments "Missing comment", :engineer "Esther", :credit 2}
{:comments "Extraordinary", :engineer "Esther", :credit 4}]}