Question

我正在为python hadoop流寻找一个很好的模式，它涉及加载一个昂贵的资源，例如服务器上的pickle python对象。这就是我想出来的;我已经通过管道输入文件和慢速运行的程序直接测试bash中的脚本进行测试，但还没有在hadoop集群上运行它。对于你hadoop向导---我处理io这样可以作为python流媒体工作吗？我想我会在亚马逊上调试一些东西进行测试，但如果有人知道这一点会很好。

您可以通过cat file.txt | the_script或./a_streaming_program | the_script

进行测试

#!/usr/bin/env python

import sys
import time

def resources_for_many_lines():
  # load slow, shared resources here
  # for example, a shared pickle file

  # in this example we use a 1 second sleep to simulate
  # a long data load
  time.sleep(1)

  # we will pretend the value zero is the product
  # of our long slow running import
  resource = 0

  return resource

def score_a_line(line, resources):
  # put fast code to score a single example line here
  # in this example we will return the value of resource + 1
  return resources + 1

def run():
  # here is the code that reads stdin and scores the model over a streaming data set
  resources = resources_for_many_lines()
  while 1:
    # reads a line of input
    line = sys.stdin.readline()

    # ends if pipe closes
    if line == '':
      break

    # scores a line
    print score_a_line(line, resources)
    # prints right away instead of waiting
    sys.stdout.flush()

if __name__ == "__main__":
  run();

Answer 1

这对我来说很好看。我经常在我的映射器中加载yaml或sqlite资源。

你通常不会在你的工作中运行许多映射器，所以即使你花了几秒钟从磁盘加载东西，它通常也不是一个大问题。

Hadoop流媒体---昂贵的共享资源（COOL）

1 个答案: