Question

我需要将目录中的一堆文件上传到S3。由于上传所需的时间超过90％用于等待http请求完成，我想以某种方式同时执行其中的几个。

纤维能帮我解决这个问题吗？它们被描述为解决此类问题的一种方法，但我无法想到在http调用阻塞时我可以做任何工作。

任何方式我都可以在没有线程的情况下解决这个问题？

Answer 1

我没有使用1.9中的光纤，但1.8.6的常规线程可以解决这个问题。尝试使用队列http://ruby-doc.org/stdlib/libdoc/thread/rdoc/classes/Queue.html

查看文档中的示例，您的使用者是执行上载的部分。它“消耗”一个URL和一个文件，并上传数据。生产者是您的程序的一部分，它继续工作并找到要上载的新文件。

如果您想一次上传多个文件，只需为每个文件启动一个新的线程：

t = Thread.new do
  upload_file(param1, param2)
end
@all_threads << t

然后，稍后在你的'生产者'代码中（记住，不必在自己的Thread中，它可能是主程序）：

@all_threads.each do |t|
  t.join if t.alive?
end

队列可以是@member_variable或$ global。

Answer 2

您可以为此使用单独的进程而不是线程：

#!/usr/bin/env ruby

$stderr.sync = true

# Number of children to use for uploading
MAX_CHILDREN = 5

# Hash of PIDs for children that are working along with which file
# they're working on.
@child_pids = {}

# Keep track of uploads that failed
@failed_files = []

# Get the list of files to upload as arguments to the program
@files = ARGV


### Wait for a child to finish, adding the file to the list of those
### that failed if the child indicates there was a problem.
def wait_for_child
    $stderr.puts "    waiting for a child to finish..."
    pid, status = Process.waitpid2( 0 )
    file = @child_pids.delete( pid )
    @failed_files << file unless status.success?
end


### Here's where you'd put the particulars of what gets uploaded and
### how. I'm just sleeping for the file size in bytes * milliseconds
### to simulate the upload, then returning either +true+ or +false+
### based on a random factor.
def upload( file )
    bytes = File.size( file )
    sleep( bytes * 0.00001 )
    return rand( 100 ) > 5
end


### Start a child uploading the specified +file+.
def start_child( file )
    if pid = Process.fork
        $stderr.puts "%s: uploaded started by child %d" % [ file, pid ]
        @child_pids[ pid ] = file
    else
        if upload( file )
            $stderr.puts "%s: done." % [ file ]
            exit 0 # success
        else
            $stderr.puts "%s: failed." % [ file ]
            exit 255
        end
    end
end


until @files.empty?

    # If there are already the maximum number of children running, wait 
    # for one to finish
    wait_for_child() if @child_pids.length >= MAX_CHILDREN

    # Start a new child working on the next file
    start_child( @files.shift )

end


# Now we're just waiting on the final few uploads to finish
wait_for_child() until @child_pids.empty?

if @failed_files.empty?
    exit 0
else
    $stderr.puts "Some files failed to upload:",
        @failed_files.collect {|file| "  #{file}" }
    exit 255
end

Answer 3

回答您的实际问题：

光纤可以帮我解决这个问题吗？

不，他们不能。 JörgWMittag explains why best。

不，你不能用Fibers做兼容。光纤根本不是并发构造，它们是控制流构造，如异常。这就是纤维的全部要点：它们从不并行运行，它们是合作的，它们是确定性的。纤维是协程。（事实上，我从来不明白他们为什么不简单地称为Coroutines。）

Ruby中唯一的并发构造是Thread。

当他说Ruby中唯一的并发构造是Thread时，请记住Ruby存在许多不同的含义，并且它们的线程实现也各不相同。 Jörg再次provides a great answer对这些差异进行了讨论;并正确地得出结论，只有像JRuby（使用映射到本机线程的JVM线程）或分支您的进程才能实现真正的并行性。

我可以在没有线程的情况下解决这个问题吗？

除了分叉您的流程外，我还建议您查看EventMachine和em-http-request之类的内容。它是一个事件驱动的，非阻塞的，基于reactor pattern的HTTP客户端，它是异步的，不会产生线程开销。

Answer 4

Aaron Patterson（@tenderlove）使用了几乎与您完全一样的示例来描述

大多数I / O库现在足够聪明，可以在执行IO时释放GVL（全局VM锁，或者大多数人将其视为GIL或全局解释器锁）。在C中有一个简单的函数调用来执行此操作。您不需要担心C代码，但是对于您来说这意味着大多数值得他们盐的IO库将释放GVL并允许其他线程执行，而执行IO的线程等待数据返回

如果我刚刚说的话令人困惑，你不必太担心它。您需要知道的主要事情是，如果您正在使用一个像样的库来执行您的HTTP请求（或任何其他I / O操作...数据库，进程间通信，无论如何），Ruby解释器（MRI）足够聪明，能够释放解释器上的锁，并允许其他线程在一个线程等待IO返回时执行。如果下一个线程有自己的IO要抓取，那么Ruby解释器也会做同样的事情（假设构建IO库是为了利用Ruby的这个特性，我相信这些日子最近都是这样）。

所以，总结我的意思，使用线程！您应该看到性能优势。如果没有，请检查您的http库是否在C中使用rb_thread_blocking_region（）函数，如果没有，请查明原因。也许有充分的理由，也许您需要考虑使用更好的库。

Aaron Patterson视频的链接在这里：http://www.youtube.com/watch?v=kufXhNkm5WU

值得一看，即使只是为了笑，因为Aaron Patterson是互联网上最有趣的人之一。

如何使用红宝石纤维以避免阻塞IO

4 个答案: