输入集＃1：bookmarks.csv

2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|
4 | news.ycombinator.com |新闻|科技新闻
5 | bing.com |搜索|竞争者

输入集＃2：bookmarks2.csv

1 | www.google.com |搜索|搜索之王
2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|新评论
4 | news.ycombinator.com |新闻|科技新闻

输出

第1集中缺少ID＃1 第2集中缺少ID＃5 Id＃3不同：
- ＆GT; www.msnbc.com |搜索|
- ＆gt; www.msnbc.com |搜索|新评论

Answer 1

(use '[clojure.contrib str-utils duck-streams pprint]
     '[clojure set])

(defn read-bookmarks [filename]
  (apply hash-map
         (mapcat #(re-split #"\|" % 2)
                 (read-lines filename))))

(defn diff-bookmarks [filename1 filename2]
  (let [f1 (read-bookmarks filename1)
        f2 (read-bookmarks filename2)
        k1 (set (keys f1))
        k2 (set (keys f2))
        missing-in-1 (difference k2 k1)
        missing-in-2 (difference k1 k2)
        present-but-different (filter #(not= (f1 %) (f2 %))
                                      (intersection k1 k2))]
    (cl-format nil "~{Id #~a is missing in set #1~%~}~{Id #~a is missing in set #2~%~}~{~{Id #~a is different~%  -> ~a~%  -> ~a~%~}~}"
               missing-in-1
               missing-in-2
               (map #(list % (f1 %) (f2 %))
                    present-but-different))))

(print (diff-bookmarks "bookmarks.csv" "bookmarks2.csv"))

Answer 2

用re rexxp拆分它们并用它们设置一个（应用set（re-seq ...）然后调用（差值set1 set2）来找到第1组中没有设置的东西。反转它在第2组中找到不在第一组中的项目。

查看http://clojure.org/data_structures以获取有关clojure集的更多信息。

Answer 3

这是我对这个问题采用功能性方法的尝试：

创建2个地图，每个文件一个
使用dissoc
使用intersection和filter

代码

(ns diffset
  (:use [clojure.contrib.duck-streams]
        [clojure.set]))

(def file1 "bookmarks.csv")
(def file2 "bookmarks2.csv")

(defn split-record [line]
  "split line into (id, bookmark)"
  (map #(apply str %)
       (split-with #(not (= % \|)) line)))

(defn map-from-file [f]
  "create initial map from file f"
  (with-open [r (reader f)]
    (doall (apply hash-map (apply concat (map split-record
                                              (line-seq r)))))))

(defn missing [x y]
  "return seq of all ids in x that are not in y"
  (keys (apply dissoc x (keys y))))

(defn different [x y]
  "return seq of all ids that match but have different bookmark string"
  (let [match-keys (intersection (set (keys x)) (set (keys y)))]
    (filter #(not (= (get x %)
                     (get y %)))
            match-keys)))

(defn diff [file1 file2]
  "print out differences between two bookmark files"
  (let [[s1 s2] (map map-from-file [file1 file2])]
    (dorun (map #(println (format "Id #%s is missing in set #1" %))
                (missing s2 s1)))
    (dorun (map #(println (format "Id #%s is missing in set #2" %))
                (missing s1 s2)))
    (dorun (map #(println (format "Id #%s is different:" %) "\n"
                          " ->" (get s1 %) "\n"
                          " ->" (get s2 %)) (different s1 s2)))))

结果

user> (use 'diffset)
nil
user> (diff file1 file2)
Id #1 is missing in set #1
Id #5 is missing in set #2
Id #3 is different: 
  -> |www.msnbc.com|Search| 
  -> |www.msnbc.com|Search|New Comment
nil

Answer 4

将第一个数据放入字典（哈希表）中，并将id作为键

逐行读取下一个数据，从散列中检索id。

如果id不在散列中，则输出：id在集合1中缺失
如果has中的值不同，则输出：id不同
将id存储在第二个哈希表中

然后运行第一个哈希表的键

检查它们是否也在第二个哈希表中。如果没有输出：set2

如何找到2个数据集的差异？

输入集＃1：bookmarks.csv

输入集＃2：bookmarks2.csv

输出

4 个答案:

代码

结果