Question

我跟随tutorial on how to define my own op for TensorFlow in C++。

我想在我的自定义TensorFlow C ++ op中调用sgemm。我正在编写两个内核，一个用于CUDA，另一个用于CPU。在每种情况下，sgemm如何调用？或者是否存在适用于这两种情况的通用方法？

我尝试使用此代码段但由于缺少包含文件而无法使其工作（请参阅here）：

auto dev_ctx = context->op_device_context();
auto* dev_stream = dev_ctx->stream();
OP_REQUIRES(context, dev_stream, errors::Internal("No stream available."));

bool blas_launch_status =
    dev_stream
         ->ThenBlasGemm(...

此外，不确定这是否是通用的，或者仅限于CUDA。

这是否记录在案？

如何在GPU / CUDA实施中调用cublasSgemm？或者更确切地说，如何获得cublasHandle_t？

我在TF代码中搜索了一下，class CUDABlas似乎提供了围绕cuBLAS函数的包装器。我是否需要使用此功能，还是可以直接使用cublasSgemm？我想我需要使用包装器，因为这将确保CUDA流执行器保持在一个理智的状态？我如何使用包装器？

我还发现contrib/rnn/kernels/blas_gemm.cc和core/kernels/matmul_op.cc似乎做了我想要的事情。代码如下所示：

#define EIGEN_USE_THREADS

#if GOOGLE_CUDA
#include "tensorflow/core/platform/stream_executor.h"
#endif  // GOOGLE_CUDA

#include "tensorflow/contrib/rnn/kernels/blas_gemm.h"
#include "tensorflow/core/framework/op_kernel.h"
namespace tensorflow {

#if GOOGLE_CUDA
namespace {
template <typename T>
perftools::gputools::DeviceMemory<T> AsDeviceMemory(const T* cuda_memory) {
  perftools::gputools::DeviceMemoryBase wrapped(const_cast<T*>(cuda_memory));
  perftools::gputools::DeviceMemory<T> typed(wrapped);
  return typed;
}
}  // namespace
#endif  // GOOGLE_CUDA

namespace functor {
template <typename T>
void TensorCuBlasGemm<T>::operator()(OpKernelContext* ctx,
                                     bool transa, bool transb, uint64 m,
                                     uint64 n, uint64 k, T alpha, const T* a,
                                     int lda, const T* b, int ldb, T beta, T* c,
                                     int ldc) {
#if GOOGLE_CUDA
  perftools::gputools::blas::Transpose trans[] = {
      perftools::gputools::blas::Transpose::kNoTranspose,
      perftools::gputools::blas::Transpose::kTranspose};

  auto a_ptr = AsDeviceMemory(a);
  auto b_ptr = AsDeviceMemory(b);
  auto c_ptr = AsDeviceMemory(c);

  bool blas_launch_status =
      ctx->op_device_context()
          ->stream()
          ->ThenBlasGemm(trans[transa], trans[transb], m, n, k, alpha, a_ptr,
                         lda, b_ptr, ldb, beta, &c_ptr, ldc)
          .ok();
  OP_REQUIRES(ctx, blas_launch_status, errors::Aborted("CuBlasGemm failed!"));
#else
  ctx->SetStatus(errors::InvalidArgument("CuBlasGemm needs CUDA."));
#endif
}

即。在我的Compute(OpKernelContext* ctx)中，我会打电话给

ctx->op_device_context()
      ->stream()
      ->ThenBlasGemm(...)

我试过了，但似乎有些包含标题丢失了（TensorFlow 0.12.0 with GPU for Linux）。我得到fatal error: tensorflow/stream_executor/lib/status.h: No such file or directory。我报告了上游here。

是否有关于所有这些的文档，即如何处理cuBLAS，或者这个DeviceStream接口，流执行器逻辑等？

我目前的解决方案有点像黑客。对于CPU，我尝试链接系统上的一些可用Blas库并从那里使用sgemm。对于CUDA，我链接到tensorflow/contrib/rnn/python/ops/_lstm_ops.so，因为在那里我找到了TensorCuBlasGemm我可以使用。见here。基本上，在该贡献中，他们面临同样的问题，并提出了this。但这部分取决于一般不可用的包含文件，请参阅上面的问题。

Answer 1

您可以尝试以下对我今天有用的方法：在您开头的* .cu.cc文件中：

database_name

在 functor 实现中的同一* .cu.cc文件中：

#include <cublas_v2.h>
cublasHandle_t cublas_handle = NULL;

其中if (cublas_handle == NULL) { assert(cublasCreate(&cublas_handle) == CUBLAS_STATUS_SUCCESS); asert(cublasSetStream(cublas_handle, d.stream()) == CUBLAS_STATUS_SUCCESS); }作为参数从* .cc文件传递到函子的值，其值为d

希望这会有所帮助，加油！

如何在自定义TensorFlow C ++ op中调用sgemm

1 个答案: