Using The NVIDIA CUDA Stream-Ordered Memory Allocator, Part 1

Most CUDA builders are acquainted with the cudaMalloc and cudaFree API features to allocate GPU accessible memory. Nonetheless, there has long been an obstacle with these API functions: they aren’t stream ordered. On this submit, we introduce new API capabilities, cudaMallocAsync and cudaFreeAsync, that allow memory allocation and deallocation to be stream-ordered operations. Partly 2 of this series, we spotlight the advantages of this new capability by sharing some big information benchmark outcomes and provide a code migration guide for modifying your existing applications. We additionally cover advanced topics to reap the benefits of stream-ordered memory allocation in the context of multi-GPU entry and the usage of IPC. This all helps you enhance performance inside your current applications. The next code instance on the left is inefficient as a result of the primary cudaFree name has to watch for kernelA to finish, so it synchronizes the device before freeing the memory. To make this run extra effectively, the memory might be allocated upfront and sized to the larger of the two sizes, as proven on the suitable.

This will increase code complexity in the applying because the memory administration code is separated out from the enterprise logic. The issue is exacerbated when different libraries are involved. This is way more durable for the applying to make efficient because it might not have full visibility or management over what the library is doing. To bypass this downside, the library must allocate memory when that perform is invoked for the primary time and by no means free it until the library is deinitialized. This not only will increase code complexity, Memory Wave but it also causes the library to carry on to the memory longer than it needs to, potentially denying another portion of the application from utilizing that memory. Some functions take the concept of allocating memory upfront even additional by implementing their very own customized allocator. This provides a significant quantity of complexity to software development. CUDA aims to supply a low-effort, high-efficiency different.

CUDA 11.2 introduced a stream-ordered memory allocator to solve these types of issues, with the addition of cudaMallocAsync and cudaFreeAsync. These new API functions shift memory allocation from international-scope operations that synchronize all the machine to stream-ordered operations that allow you to compose memory management with GPU work submission. This eliminates the need for synchronizing outstanding GPU work and helps limit the lifetime of the allocation to the GPU work that accesses it. It is now attainable to handle memory at perform scope, as in the following example of a library perform launching kernelA. All the same old stream-ordering guidelines apply to cudaMallocAsync and cudaFreeAsync. The memory returned from cudaMallocAsync might be accessed by any kernel or Memory Wave memcpy operation as long as the kernel or memcpy is ordered to execute after the allocation operation and earlier than the deallocation operation, MemoryWave Community in stream order. Deallocation could be carried out in any stream, as long as it is ordered to execute after the allocation operation and in any case accesses on all streams of that memory on the GPU.

In impact, stream-ordered allocation behaves as if allocation and free were kernels. If kernelA produces a valid buffer on a stream and kernelB invalidates it on the same stream, then an utility is free to entry the buffer after kernelA and earlier than kernelB in the appropriate stream order. The following instance shows varied legitimate usages. Determine 1 shows the varied dependencies specified in the earlier code example. As you can see, all kernels are ordered to execute after the allocation operation and complete earlier than the deallocation operation. Memory allocation and deallocation cannot fail asynchronously. Memory errors that happen because of a name to cudaMallocAsync or cudaFreeAsync (for instance, out of memory) are reported immediately through an error code returned from the call. If cudaMallocAsync completes successfully, the returned pointer is guaranteed to be a legitimate pointer to memory that is secure to entry in the suitable stream order. The CUDA driver uses memory pools to attain the conduct of returning a pointer instantly.