Question

假设我有4个文件夹，每个文件夹有25个文件夹。在这25个文件夹的每个文件夹中，有20个文件夹，每个文件夹有1个非常长的文本我现在使用的方法似乎还有改进的空间，在我实现ruby线程的每个场景中，结果都比以前慢。我有一个54个文件夹的数组。我遍历每个并使用foreach方法获取深层嵌套文件。在foreach循环中，我做了3件事。我得到今天文件的内容，我得到了昨天文件的内容，我使用我的差异算法来查找从昨天到今天的变化。你会怎么用线程更快地做到这一点。

def backup_differ_loop device_name

  device_name.strip!
  Dir.foreach("X:/Backups/#{device_name}/#{@today}").each do |backup|

    if backup != "." and backup != ".."
      @today_filename = "X:/Backups/#{device_name}/#{@today}/#{backup}"
      @yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"

      if File.exists?(@yesterday_filename)
        today_backup_content = File.open(@today_filename, "r").read
        yesterday_backup_content = File.open(@yesterday_filename, "r").read

        begin
         Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
        rescue
         #do nothing just continue
        end

        end

      else
       #file not found
      end

    end

  end

Answer 1

逻辑的第一部分是查找特定文件夹中的所有文件。而不是做Dir.foreach然后检查＆＃34;。＆＃34;和＆＃34; ..＆＃34;你可以在一行中做到这一点：

files = Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}

注意最后的/*？这将搜索1级深度（在@today文件夹内）。如果您也想在子文件夹中搜索，请将其替换为/**/*，这样您就可以在@today的所有子文件夹中获取所有文件的数组。

所以我首先要有一个方法，它会给我一个包含一堆匹配文件数组的双数组：

def get_matching_files
  matching_files = []

  Dir.glob("X:/Backups/#{device_name}/#{@today}/*").select { |item| File.file?(item)}.each do |backup|
    today_filename = File.absolute_path(backup) # should get you X:/Backups...converts to an absolute path
    yesterday_filename = "X:/Backups/#{device_name}/#{@yesterday}/#{backup.gsub(@today, @yesterday)}"

    if File.exists?(yesterday_filename)
      matching_files << [today_filename, yesterday_filename]
    end
  end

  return matching_files
end

并称之为：

matching_files = get_matching_files

现在我们可以开始多线程，这可能是事情可能放缓的地方。我首先将数组matching_files中的所有文件都放入队列中，然后启动5个线程，直到队列为空：

queue = Queue.new
matching_files.each { |file| queue << file }

# 5 being the number of threads
5.times.map do
  Thread.new do
    until queue.empty?
      begin
        today_file_content, yesterday_file_content = queue.pop
        Diffy::Diff.new(yesterday_backup_content, today_backup_content, :include_plus_and_minus_in_html => true, :context => 1).to_s(:html)
      rescue
        #do nothing just continue
      end
    end
  end
end.each(&:join)

我无法保证我的代码能够正常运行，因为我没有完整的程序环境。我希望我能给你一些想法。

最重要的是：Ruby的标准实现一次只能运行1个线程。这意味着即使您实施上述代码，您也不会获得显着的性能差异。因此，让Rubinius或JRuby允许一次运行多于1个线程。或者如果您更喜欢使用标准MRI Ruby，那么您需要重新构建代码（您可以保留原始版本）并启动多个流程。您只需要一个类似共享数据库的东西，您可以在其中存储matching_files（例如，作为单行），并且每次进程都将采取＆＃39;来自该数据库的东西，它会将该行标记为＆＃39; used＆＃39;。我认为SQLite是一个很好的数据库，因为它默认情况下是线程安全的。

Answer 2

大多数Ruby实施都没有＆＃34; true＆＃34;多线程线程，即线程不会获得任何性能提升，因为解释器一次只能运行一个线程。对于像你这样拥有大量磁盘IO的应用程序，这尤其如此。实际上，即使使用真正的多线程，您的应用程序也可能受IO限制，但仍然没有太大的改进。

您更有可能通过在代码中找到一些低效算法并改进它来获得结果。

Ruby线程与正常

2 个答案: