Question

我刚读完Venkat Subramaniam撰写的“在JVM上编程并发”，在那本书中，作者用他的例子来计算目录树中的文件大小。他展示了不使用并发，使用队列，使用锁存器和使用scala actor的实现。在我的系统上，当遍历我的/ usr目录（OSX 10.6.8，Core Duo 2 Ghz，Intel G1 ssd 160GB）时，所有并发实现（队列，latch和scala actor）都能在9秒内运行。

我正在学习Clojure，并决定使用代理将Scala actor版本移植到Clojure。不幸的是，我的平均时间是11-12秒，这明显慢于其他人。花了DAYS把我的头发拉出来后，我发现下面的代码是罪魁祸首（processFile是我发送给文件处理代理的函数：

(defn processFile
  [fileProcessor collectorAgent ^String fileName]
  (let [^File file-obj (File. ^String fileName)
        fileTotals (transient {:files 0, :bytes 0})]
    (cond
      (.isDirectory file-obj)
        (do
          (doseq [^File dir (.listFiles file-obj) :when (.isDirectory dir)]
            (send collectorAgent addFileToProcess (.getPath dir)))
          (send collectorAgent tallyResult *agent*)
          (reduce (fn [currentTotal newItem] (assoc! currentTotal :files (inc (:files currentTotal))
                                                                  :bytes (+ (:bytes currentTotal) newItem)))
                  fileTotals
                  (map #(.length ^File %) (filter #(.isFile ^File %) (.listFiles file-obj))))
          (persistent! fileTotals))

      (.isFile file-obj) (do (send collectorAgent tallyResult *agent*) {:files 1, :bytes (.length file-obj)}))))

你会注意到我尝试使用类型提示和瞬态来提高性能，但都无济于事。我用以下内容替换了上面的代码：

(defn processChildren
  [children]
  (loop [entries children files 0 bytes 0 dirs '()]
    (let [^File child (first entries)]
      (cond
        (not (seq entries)) {:files files, :bytes bytes, :dirs dirs}
        (.isFile child) (recur (rest entries) (inc files) (+ bytes (.length child)) dirs)
        (.isDirectory child) (recur (rest entries) files bytes (conj dirs child))
        :else (recur (rest entries) files bytes dirs)))))

(defn processFile
  [fileProcessor collectorAgent ^String fileName]
  (let [{files :files, bytes :bytes, dirs :dirs} (processChildren (.listFiles (File. fileName)))]
    (doseq [^File dir dirs]
      (send collectorAgent addFileToProcess (.getPath dir)))
    (send collectorAgent tallyResult *agent*)
    {:files files, :bytes bytes}))

如果不比Scala版本快，那么这个版本在par上执行，并且几乎与Scala版本中使用的算法相同。我只是假设算法的功能方法也能正常工作。

所以...这个冗长的问题归结为以下几点：为什么第二个版本更快？

我的假设是，尽管使用map / filter / reduce对目录内容的第一个版本比第二个版本对目录的相当必要的处理更“有用”，但它的效率要低得多，因为目录的内容正在迭代多次。由于文件系统I / O很慢，整个程序都会受到影响。

假设我是对的，是不是可以安全地说任何递归文件系统算法都应该选择一种强制性的性能方法？

我是Clojure的初学者，所以如果我做了一些愚蠢或非惯用的事情，请随意将我的代码撕成碎片。

Answer 1

我编辑了第一个版本，使其更具可读性。我有一些评论，但没有最终有用的陈述：

您添加了瞬态和类型提示，没有真正的证据表明减慢了什么。通过粗心地应用这些操作，完全可以减慢速度，因此最好通过剖析来找出实际上减慢了什么。您的选择似乎合理，但我删除了显然没有效果的类型提示（例如，编译器不需要提示知道（File。...）产生File对象）。
Clojure（实际上，任何口齿不清）强烈倾向于some-agent到someAgent。前缀语法意味着不必担心-可以被无能的编译器解析为减法，因此我们可以提供更多间隔良好的名称。
您可以包含对您未在此处定义的一系列函数的调用，例如tallyResult和addFileToProcess。据推测它们表现良好，因为你在高性能版本中使用它们，但是由于不包括它们，你已经让其他任何人都难以找到它并看看它加速了什么。
考虑发送 - 而不是发送I / O绑定操作：send使用有界线程池来避免淹没您的处理器。这可能无关紧要，因为你只使用一个代理并且序列化，但将来你会遇到重要的情况。

无论如何，正如所承诺的那样，对你的第一个版本进行更清晰的重写：

(defn process-file
  [_ collector-agent ^String file-name]
  (let [file-obj (File. file-name)
        file-totals (transient {:files 0, :bytes 0})]
    (cond (.isDirectory file-obj)
          (do
            (doseq [^File dir (.listFiles file-obj)
                    :when (.isDirectory dir)]
              (send collector-agent addFileToProcess
                    (.getPath dir)))
            (send collector-agent tallyResult *agent*)
            (reduce (fn [current-total new-item]
                      (assoc! current-total
                              :files (inc (:files current-total))
                              :bytes (+ (:bytes current-total) new-item)))
                    file-totals
                    (map #(.length ^File %)
                         (filter #(.isFile ^File %)
                                 (.listFiles file-obj)))) -
            (persistent! file-totals))

          (.isFile file-obj)
          (do (send collector-agent tallyResult *agent*)
              {:files 1, :bytes (.length file-obj)}))))

编辑：你以不正确的方式使用瞬变，通过丢弃你的减少的结果。 (assoc! m k v) 允许修改并返回m对象，但如果更方便或更有效，可能会返回另一个对象。所以你需要更像(persistent! (reduce ...))

是否应该以命令式方式处理递归文件系统算法？

1 个答案: