Question

我有以下代码行，using (var document = WordprocessingDocument.Create(filename, WordprocessingDocumentType.Document)) { document.AddMainDocumentPart(); document.MainDocumentPart.Document = new Document(); document.MainDocumentPart.Document.Body = new Body(); //create a memory stream with the HTML required MemoryStream ms = new MemoryStream(Encoding.UTF8.GetBytes("<html><p>test<br/>test2<ul><li>item1</li><li>item2<li2></p><html>")); //Create an alternative format import part on the MainDocumentPart AlternativeFormatImportPart altformatImportPart = document.MainDocumentPart.AddAlternativeFormatImportPart(AlternativeFormatImportPartType.Html); //Add the HTML data into the alternative format import part altformatImportPart.FeedData(ms); //create a new altChunk and link it to the id of the AlternativeFormatImportPart AltChunk altChunk = new AltChunk(); altChunk.Id = document.MainDocumentPart.GetIdOfPart(altformatImportPart); // add the altChunk to the document document.MainDocumentPart.Document.Body.Append(altChunk); document.Save(); }是一个CPU变量，在我需要复制到GPU之后。 gamma和gamma_x也存储在CPU中。有什么方法可以执行以下行并将其结果直接存储在GPU上？因此，基本上，在GPU上托管delta，gamma和gamma_x，并在GPU上获取下一行的输出。之后的代码行将大大加快我的代码的速度。我尝试使用delta，但到目前为止，由于magma_dcopy的输出是CPU的两倍，因此我找不到使它工作的方法。

magma_ddot

Answer 1

一个简短的答案是“否”，您不能这样做，或者如果使用magma_ddot，至少不能这样做。

然而，magma_ddot本身只是cublasDdot周围的一个非常薄的包装，而cublas函数完全支持将运算结果存储在GPU内存中，而不是返回给主机。

在理论中，您可以执行以下操作：

// before the apparent loop you have not shown us:
double* dotresult;
cudaMalloc(&dotresult, sizeof(double));

for (int i=....) { 
    // ...

    // magma_ddot(i,&d_gamma_x[1],1,&(d_l2)[1],1, queue);
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_DEVICE);
    cublasDdot(queue->cublas_handle(), i, &d_gamma_x[1], 1, &(d_l2)[1], 1, &dotresult);
    cudaDeviceSynchronize();
    cublasSetPointerMode( queue->cublas_handle(), CUBLAS_POINTER_MODE_HOST);

    // Now dotresult holds the magma_ddot result in device memory

    // ...

}

请注意，可能会使Magma崩溃，具体取决于您的使用方式，因为Magma在内部使用CUBLAS，并且在Magma内部如何处理CUBLAS状态和异步操作都完全没有记录。话虽如此，如果您小心的话，应该没问题。

要执行计算，可以编写一个非常简单的内核并使用一个线程启动它，或者根据您的喜好使用带有lambda表达式的简单推力调用。我把它留给读者练习。

如何在GPU上执行基本操作（+-* /）并在其上存储结果

1 个答案: