如何找到2个数据集的差异?

时间:2009-08-14 08:07:08

标签: clojure

例如,如果我有2个包含书签数据的管道分隔文件。如何读入数据然后确定两组数据的差异?

输入集#1:bookmarks.csv

2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|
4 | news.ycombinator.com |新闻|科技新闻
5 | bing.com |搜索|竞争者

输入集#2:bookmarks2.csv

1 | www.google.com |搜索|搜索之王
2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|新评论
4 | news.ycombinator.com |新闻|科技新闻

输出

第1集中缺少ID#1 第2集中缺少ID#5 Id#3不同:
- > www.msnbc.com |搜索|
- > www.msnbc.com |搜索|新评论

4 个答案:

答案 0 :(得分:5)

(use '[clojure.contrib str-utils duck-streams pprint]
     '[clojure set])

(defn read-bookmarks [filename]
  (apply hash-map
         (mapcat #(re-split #"\|" % 2)
                 (read-lines filename))))

(defn diff-bookmarks [filename1 filename2]
  (let [f1 (read-bookmarks filename1)
        f2 (read-bookmarks filename2)
        k1 (set (keys f1))
        k2 (set (keys f2))
        missing-in-1 (difference k2 k1)
        missing-in-2 (difference k1 k2)
        present-but-different (filter #(not= (f1 %) (f2 %))
                                      (intersection k1 k2))]
    (cl-format nil "~{Id #~a is missing in set #1~%~}~{Id #~a is missing in set #2~%~}~{~{Id #~a is different~%  -> ~a~%  -> ~a~%~}~}"
               missing-in-1
               missing-in-2
               (map #(list % (f1 %) (f2 %))
                    present-but-different))))

(print (diff-bookmarks "bookmarks.csv" "bookmarks2.csv"))

答案 1 :(得分:3)

用re rexxp拆分它们并用它们设置一个(应用set(re-seq ...)然后调用(差值set1 set2)来找到第1组中没有设置的东西。反转它在第2组中找到不在第一组中的项目。

查看http://clojure.org/data_structures以获取有关clojure集的更多信息。

答案 2 :(得分:3)

这是我对这个问题采用功能性方法的尝试:

  • 创建2个地图,每个文件一个
  • 使用dissoc
  • 在两张地图之间找到缺失的项目
  • 使用intersectionfilter
  • 在两张地图之间查找不同但共享的内容

代码

(ns diffset
  (:use [clojure.contrib.duck-streams]
        [clojure.set]))

(def file1 "bookmarks.csv")
(def file2 "bookmarks2.csv")

(defn split-record [line]
  "split line into (id, bookmark)"
  (map #(apply str %)
       (split-with #(not (= % \|)) line)))

(defn map-from-file [f]
  "create initial map from file f"
  (with-open [r (reader f)]
    (doall (apply hash-map (apply concat (map split-record
                                              (line-seq r)))))))

(defn missing [x y]
  "return seq of all ids in x that are not in y"
  (keys (apply dissoc x (keys y))))

(defn different [x y]
  "return seq of all ids that match but have different bookmark string"
  (let [match-keys (intersection (set (keys x)) (set (keys y)))]
    (filter #(not (= (get x %)
                     (get y %)))
            match-keys)))

(defn diff [file1 file2]
  "print out differences between two bookmark files"
  (let [[s1 s2] (map map-from-file [file1 file2])]
    (dorun (map #(println (format "Id #%s is missing in set #1" %))
                (missing s2 s1)))
    (dorun (map #(println (format "Id #%s is missing in set #2" %))
                (missing s1 s2)))
    (dorun (map #(println (format "Id #%s is different:" %) "\n"
                          " ->" (get s1 %) "\n"
                          " ->" (get s2 %)) (different s1 s2)))))

结果

user> (use 'diffset)
nil
user> (diff file1 file2)
Id #1 is missing in set #1
Id #5 is missing in set #2
Id #3 is different: 
  -> |www.msnbc.com|Search| 
  -> |www.msnbc.com|Search|New Comment
nil

答案 3 :(得分:1)

将第一个数据放入字典(哈希表)中,并将id作为键

逐行读取下一个数据,从散列中检索id。

  • 如果id不在散列中,则输出:id在集合1中缺失
  • 如果has中的值不同,则输出:id不同
  • 将id存储在第二个哈希表中

然后运行第一个哈希表的键

  • 检查它们是否也在第二个哈希表中。如果没有输出:set2
  • 中缺少id