例如,如果我有2个包含书签数据的管道分隔文件。如何读入数据然后确定两组数据的差异?
2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|
4 | news.ycombinator.com |新闻|科技新闻
5 | bing.com |搜索|竞争者
1 | www.google.com |搜索|搜索之王
2 | www.cnn.com |新闻|这是CNN
3 | www.msnbc.com |搜索|新评论
4 | news.ycombinator.com |新闻|科技新闻
第1集中缺少ID#1
第2集中缺少ID#5
Id#3不同:
- > www.msnbc.com |搜索|
- > www.msnbc.com |搜索|新评论
答案 0 :(得分:5)
(use '[clojure.contrib str-utils duck-streams pprint] '[clojure set]) (defn read-bookmarks [filename] (apply hash-map (mapcat #(re-split #"\|" % 2) (read-lines filename)))) (defn diff-bookmarks [filename1 filename2] (let [f1 (read-bookmarks filename1) f2 (read-bookmarks filename2) k1 (set (keys f1)) k2 (set (keys f2)) missing-in-1 (difference k2 k1) missing-in-2 (difference k1 k2) present-but-different (filter #(not= (f1 %) (f2 %)) (intersection k1 k2))] (cl-format nil "~{Id #~a is missing in set #1~%~}~{Id #~a is missing in set #2~%~}~{~{Id #~a is different~% -> ~a~% -> ~a~%~}~}" missing-in-1 missing-in-2 (map #(list % (f1 %) (f2 %)) present-but-different)))) (print (diff-bookmarks "bookmarks.csv" "bookmarks2.csv"))
答案 1 :(得分:3)
用re rexxp拆分它们并用它们设置一个(应用set(re-seq ...)然后调用(差值set1 set2)来找到第1组中没有设置的东西。反转它在第2组中找到不在第一组中的项目。
查看http://clojure.org/data_structures以获取有关clojure集的更多信息。
答案 2 :(得分:3)
这是我对这个问题采用功能性方法的尝试:
dissoc
intersection
和filter
(ns diffset
(:use [clojure.contrib.duck-streams]
[clojure.set]))
(def file1 "bookmarks.csv")
(def file2 "bookmarks2.csv")
(defn split-record [line]
"split line into (id, bookmark)"
(map #(apply str %)
(split-with #(not (= % \|)) line)))
(defn map-from-file [f]
"create initial map from file f"
(with-open [r (reader f)]
(doall (apply hash-map (apply concat (map split-record
(line-seq r)))))))
(defn missing [x y]
"return seq of all ids in x that are not in y"
(keys (apply dissoc x (keys y))))
(defn different [x y]
"return seq of all ids that match but have different bookmark string"
(let [match-keys (intersection (set (keys x)) (set (keys y)))]
(filter #(not (= (get x %)
(get y %)))
match-keys)))
(defn diff [file1 file2]
"print out differences between two bookmark files"
(let [[s1 s2] (map map-from-file [file1 file2])]
(dorun (map #(println (format "Id #%s is missing in set #1" %))
(missing s2 s1)))
(dorun (map #(println (format "Id #%s is missing in set #2" %))
(missing s1 s2)))
(dorun (map #(println (format "Id #%s is different:" %) "\n"
" ->" (get s1 %) "\n"
" ->" (get s2 %)) (different s1 s2)))))
user> (use 'diffset)
nil
user> (diff file1 file2)
Id #1 is missing in set #1
Id #5 is missing in set #2
Id #3 is different:
-> |www.msnbc.com|Search|
-> |www.msnbc.com|Search|New Comment
nil
答案 3 :(得分:1)
将第一个数据放入字典(哈希表)中,并将id作为键
逐行读取下一个数据,从散列中检索id。
然后运行第一个哈希表的键