I am converting a program from a regular c program to a cuda program and wanted to implement an easy wrapper for malloc that just uses a large pool.
I have 5000 threads. My block size is 1024.
Here is the buffer structure I am using to keep track of each threads memory pool.
typedef struct buffer_t
{
unsigned long size;
char* current_index;
char pool[];
} buffer_t;
As you can imagine I use:
cudaMalloc(&memptr, 262144*5000);
to do the allocation where each thread is suppose to create a buffer on its 262144 bytes
Here are the functions I am using to do the allocations:
__device__ buffer_t* buffer_constructor(size_t size, void* memptr)
{
buffer_t* buffer = (buffer_t*)memptr;
buffer->size = size - sizeof(unsigned long) - sizeof(char*);
buffer->current_index = buffer->pool;
return buffer;
}
__device__ void* buffer_malloc(buffer_t* buffer, size_t size)
{
if(size > buffer->size - (buffer->current_index - buffer->pool))
{
return NULL;
}
void* ptr = buffer->current_index;
buffer->current_index += size;
return ptr;
}
Each thread calls:
buffer_t* buffer = buffer_constructor(size, memptr+(tid * size));
So when I run the code it just returns from the kernel at some point. When I run the debugger I get this error:
Program received signal CUDA_EXCEPTION_6, Warp Misaligned Address.
[Switching focus to CUDA kernel 0, grid 1, block (2,0,0), thread (768,0,0), device 0, sm 10, warp 24, lane 0]
0x0000000000b48428 in device_matrix_list_constructor (buffer=<optimized out>, num=<optimized out>)
at device_matrix_list.cu:8
8 return list;
When I run memcheck I get a couple of these errors for a couple blocks:
Invalid __global__ write of size 8
========= at 0x00000258 in /home/crafton.b/cuda_nn/device_matrix_list.cu:7:device_matrix_list_constructor(buffer_t*, unsigned int)
========= by thread (897,0,0) in block (4,0,0)
========= Address 0x235202a0fc is misaligned
Any help is really appreciated I have been struggling with this for a while now