Clojure中的高效复制检查器

时间:2014-07-15 02:31:21

标签: clojure performance

我正在寻找一种方法来有效地确定目录中的重复(通过md5)文件,并创建一个带有“:unique”文件和“:other”文件向量的地图集合。我下面的代码可以在大约46秒(45948ms)内在2919个文件上实现这一点。

此代码有效,但必须有更快的方法..如何更改代码以获得更高的性能?

(def extensions [".mp3" ".wav" ".mp4" ".flac" ".aac"])

(defn valid? [file]
  "returns true when the file is not a directory and ends in one of the specified extensions"
  (and (not (.isDirectory file))
       (some true? (map #(.endsWith (.getName file) %) extensions))))

(defn file->file+hash [file]
  "returns a map of the filepath and the files md5"
  {:hash (d/md5 file) :path (.getAbsolutePath file)})

(defn split [[x & more]] 
  {:unique (:path x) :other (vec (map :path more))})

(defn get-dictionary [file-directory]
  "returns a map of maps, each of which contain a ':unique' file and a vector of ':other' files"
  (let [files (filter valid? (file-seq (f/file file-directory)))]
    (map split (vals (group-by :hash (pmap file->file+hash files))))))

(def location "/home/matt/Music/Playlists")
(prn (str "Files: " (count (file-seq (f/file location)))))
(time (get-dictionary location))

"Files: 2919"
"Elapsed time: 45948.444212 msecs"

1 个答案:

答案 0 :(得分:1)

您可以尝试对已知的内容进行初始比较,例如filesize,而不是hash。如果您希望大多数文件都是唯一的,这可以节省大量时间来计算所有文件的哈希值。