Question

我的问题是＆＃34; tex1Dfetch功能可以与音调线性内存一起使用＆＃34;？

Section B.8.1.1.tex1dfetch()说＆＃34;来自线性记忆区域....＆＃34;。

我正在优化一个受全局内存读取限制的内核。它使用网格跨步循环来加载float4：

float4 x = XYZW[i]; // float4 const * const XYZW
float4 x = tex1Dfetch<float4<( XYZW, i ); //   cudaTextureObject_t XYZW  
float4 x = tex1D<float4<( XYZW, i ); // cudaTextureObject_t XYZW

第一个例子，使用指针参数，工作正常。 tex1Dfetch表单返回全零。 tex1D调用返回别名/ psychadelic垃圾。

在所有情况下，都使用cudaMallocPitched。纹理加载分析速度更快，tex1Dfetch最快。所以我热衷于让它正常工作。

Answer 1

我的问题是“tex1Dfetch功能可以与斜线性存储器一起使用吗？”

在C / C ++中，指针只是一个裸数。分配区域后，无论是cudaMalloc还是cudaMallocPitch返回，返回的指针本身在概念上都不同。它是设备内存中某个位置的地址。

tex1Dfetch可以使用任何一种分配方式（cudaMalloc或cudaMallocPitch）。您获得的实际值取决于您在cudaMallocPitch情况下如何处理音高区域，而tex1Dfetch将不会为您做任何“自动”操作。因此，对我来说，使用tex1Dfetch配合音调似乎不太合理，但可以使其“工作”，即。获取正确的值。

以下是对此声明的简单演示，基于对代码here的略微修改：

$ cat t434.cu
#include <stdio.h>
#define N 32
#define M 128

// texture object is a kernel argument
__global__ void kernel(cudaTextureObject_t tex) {
  int i = blockIdx.x *blockDim.x + threadIdx.x;
  float x = tex1Dfetch<float>(tex, i);
  if (i < 256) printf("%d %f\n", i, x);
}

void call_kernel(cudaTextureObject_t tex) {
  dim3 block(128,1,1);
  dim3 grid((N*M)/block.x,1,1);
  kernel <<<grid, block>>>(tex);
}

int main() {
  // declare and allocate memory
  float *buffer, *h_buffer;
  size_t pitch;
  cudaMallocPitch(&buffer, &pitch, N*sizeof(float),M);
  printf("pitch = %lu\n", pitch);
  cudaMemset(buffer, 0, M*pitch);
  h_buffer=(float *)malloc(N*M*sizeof(float));
  for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++) h_buffer[i*N+j] = i+1;
  cudaMemcpy2D(buffer, pitch, h_buffer, N*sizeof(float), N*sizeof(float), M,  cudaMemcpyHostToDevice);
  // create texture object
  cudaResourceDesc resDesc;
  memset(&resDesc, 0, sizeof(resDesc));
  resDesc.resType = cudaResourceTypeLinear;
  resDesc.res.linear.devPtr = buffer;
  resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
  resDesc.res.linear.desc.x = 32; // bits per channel
  resDesc.res.linear.sizeInBytes = M*pitch*sizeof(float);

  cudaTextureDesc texDesc;
  memset(&texDesc, 0, sizeof(texDesc));
  texDesc.readMode = cudaReadModeElementType;

  // create texture object: we only have to do this once!
  cudaTextureObject_t tex=0;
  cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

  call_kernel(tex); // pass texture as argument

  // destroy texture object
  cudaDestroyTextureObject(tex);

  cudaFree(buffer);
}
$ nvcc -arch=sm_61 -o t434 t434.cu
$ cuda-memcheck ./t434
========= CUDA-MEMCHECK
pitch = 512
0 1.000000
1 1.000000
2 1.000000
3 1.000000
4 1.000000
5 1.000000
6 1.000000
7 1.000000
8 1.000000
9 1.000000
10 1.000000
11 1.000000
12 1.000000
13 1.000000
14 1.000000
15 1.000000
16 1.000000
17 1.000000
18 1.000000
19 1.000000
20 1.000000
21 1.000000
22 1.000000
23 1.000000
24 1.000000
25 1.000000
26 1.000000
27 1.000000
28 1.000000
29 1.000000
30 1.000000
31 1.000000
32 0.000000
33 0.000000
34 0.000000
35 0.000000
36 0.000000
37 0.000000
38 0.000000
39 0.000000
40 0.000000
41 0.000000
42 0.000000
43 0.000000
44 0.000000
45 0.000000
46 0.000000
47 0.000000
48 0.000000
49 0.000000
50 0.000000
51 0.000000
52 0.000000
53 0.000000
54 0.000000
55 0.000000
56 0.000000
57 0.000000
58 0.000000
59 0.000000
60 0.000000
61 0.000000
62 0.000000
63 0.000000
96 0.000000
97 0.000000
98 0.000000
99 0.000000
100 0.000000
101 0.000000
102 0.000000
103 0.000000
104 0.000000
105 0.000000
106 0.000000
107 0.000000
108 0.000000
109 0.000000
110 0.000000
111 0.000000
112 0.000000
113 0.000000
114 0.000000
115 0.000000
116 0.000000
117 0.000000
118 0.000000
119 0.000000
120 0.000000
121 0.000000
122 0.000000
123 0.000000
124 0.000000
125 0.000000
126 0.000000
127 0.000000
64 0.000000
65 0.000000
66 0.000000
67 0.000000
68 0.000000
69 0.000000
70 0.000000
71 0.000000
72 0.000000
73 0.000000
74 0.000000
75 0.000000
76 0.000000
77 0.000000
78 0.000000
79 0.000000
80 0.000000
81 0.000000
82 0.000000
83 0.000000
84 0.000000
85 0.000000
86 0.000000
87 0.000000
88 0.000000
89 0.000000
90 0.000000
91 0.000000
92 0.000000
93 0.000000
94 0.000000
95 0.000000
128 2.000000
129 2.000000
130 2.000000
131 2.000000
132 2.000000
133 2.000000
134 2.000000
135 2.000000
136 2.000000
137 2.000000
138 2.000000
139 2.000000
140 2.000000
141 2.000000
142 2.000000
143 2.000000
144 2.000000
145 2.000000
146 2.000000
147 2.000000
148 2.000000
149 2.000000
150 2.000000
151 2.000000
152 2.000000
153 2.000000
154 2.000000
155 2.000000
156 2.000000
157 2.000000
158 2.000000
159 2.000000
192 0.000000
193 0.000000
194 0.000000
195 0.000000
196 0.000000
197 0.000000
198 0.000000
199 0.000000
200 0.000000
201 0.000000
202 0.000000
203 0.000000
204 0.000000
205 0.000000
206 0.000000
207 0.000000
208 0.000000
209 0.000000
210 0.000000
211 0.000000
212 0.000000
213 0.000000
214 0.000000
215 0.000000
216 0.000000
217 0.000000
218 0.000000
219 0.000000
220 0.000000
221 0.000000
222 0.000000
223 0.000000
160 0.000000
161 0.000000
162 0.000000
163 0.000000
164 0.000000
165 0.000000
166 0.000000
167 0.000000
168 0.000000
169 0.000000
170 0.000000
171 0.000000
172 0.000000
173 0.000000
174 0.000000
175 0.000000
176 0.000000
177 0.000000
178 0.000000
179 0.000000
180 0.000000
181 0.000000
182 0.000000
183 0.000000
184 0.000000
185 0.000000
186 0.000000
187 0.000000
188 0.000000
189 0.000000
190 0.000000
191 0.000000
224 0.000000
225 0.000000
226 0.000000
227 0.000000
228 0.000000
229 0.000000
230 0.000000
231 0.000000
232 0.000000
233 0.000000
234 0.000000
235 0.000000
236 0.000000
237 0.000000
238 0.000000
239 0.000000
240 0.000000
241 0.000000
242 0.000000
243 0.000000
244 0.000000
245 0.000000
246 0.000000
247 0.000000
248 0.000000
249 0.000000
250 0.000000
251 0.000000
252 0.000000
253 0.000000
254 0.000000
255 0.000000
========= ERROR SUMMARY: 0 errors

我们看到分配间距是512字节。这对应于每行128 float量的总节距宽度。但是，2D分配仅指定每行32个元素（N）。我们初始化每行32个元素的2D区域，其中每个值是行索引加1.剩余的“间距”区域初始化为零。我们在输出中观察到有效定义的第0行元素正确地打印为1，有效定义的第1行元素正确地打印为2，并且所有其他元素打印为0，因为我们从“音调”区域获取。

CUDA tex1Dfetch和pitchhed memory

1 个答案: