Question

我正在寻找一种方法来有效地确定目录中的重复（通过md5）文件，并创建一个带有“：unique”文件和“：other”文件向量的地图集合。我下面的代码可以在大约46秒（45948ms）内在2919个文件上实现这一点。

此代码有效，但必须有更快的方法..如何更改代码以获得更高的性能？

(def extensions [".mp3" ".wav" ".mp4" ".flac" ".aac"])

(defn valid? [file]
  "returns true when the file is not a directory and ends in one of the specified extensions"
  (and (not (.isDirectory file))
       (some true? (map #(.endsWith (.getName file) %) extensions))))

(defn file->file+hash [file]
  "returns a map of the filepath and the files md5"
  {:hash (d/md5 file) :path (.getAbsolutePath file)})

(defn split [[x & more]] 
  {:unique (:path x) :other (vec (map :path more))})

(defn get-dictionary [file-directory]
  "returns a map of maps, each of which contain a ':unique' file and a vector of ':other' files"
  (let [files (filter valid? (file-seq (f/file file-directory)))]
    (map split (vals (group-by :hash (pmap file->file+hash files))))))

(def location "/home/matt/Music/Playlists")
(prn (str "Files: " (count (file-seq (f/file location)))))
(time (get-dictionary location))

"Files: 2919"
"Elapsed time: 45948.444212 msecs"

Answer 1

您可以尝试对已知的内容进行初始比较，例如filesize，而不是hash。如果您希望大多数文件都是唯一的，这可以节省大量时间来计算所有文件的哈希值。

Clojure中的高效复制检查器

1 个答案: