我正在寻找一种方法来有效地确定目录中的重复(通过md5)文件,并创建一个带有“:unique”文件和“:other”文件向量的地图集合。我下面的代码可以在大约46秒(45948ms)内在2919个文件上实现这一点。
此代码有效,但必须有更快的方法..如何更改代码以获得更高的性能?
(def extensions [".mp3" ".wav" ".mp4" ".flac" ".aac"])
(defn valid? [file]
"returns true when the file is not a directory and ends in one of the specified extensions"
(and (not (.isDirectory file))
(some true? (map #(.endsWith (.getName file) %) extensions))))
(defn file->file+hash [file]
"returns a map of the filepath and the files md5"
{:hash (d/md5 file) :path (.getAbsolutePath file)})
(defn split [[x & more]]
{:unique (:path x) :other (vec (map :path more))})
(defn get-dictionary [file-directory]
"returns a map of maps, each of which contain a ':unique' file and a vector of ':other' files"
(let [files (filter valid? (file-seq (f/file file-directory)))]
(map split (vals (group-by :hash (pmap file->file+hash files))))))
(def location "/home/matt/Music/Playlists")
(prn (str "Files: " (count (file-seq (f/file location)))))
(time (get-dictionary location))
"Files: 2919"
"Elapsed time: 45948.444212 msecs"
答案 0 :(得分:1)
您可以尝试对已知的内容进行初始比较,例如filesize,而不是hash。如果您希望大多数文件都是唯一的,这可以节省大量时间来计算所有文件的哈希值。