CUDA apps time out & fail after several seconds – how to work around this

cudagpgpugputimeout

I've noticed that CUDA applications tend to have a rough maximum run-time of 5-15 seconds before they will fail and exit out. I realize it's ideal to not have CUDA application run that long but assuming that it is the correct choice to use CUDA and due to the amount of sequential work per thread it must run that long, is there any way to extend this amount of time or to get around it?

Best Answer

I'm not a CUDA expert, --- I've been developing with the AMD Stream SDK, which AFAIK is roughly comparable.

You can disable the Windows watchdog timer, but that is highly not recommended, for reasons that should be obvious. To disable it, you need to regedit HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Watchdog\Display\DisableBugCheck, create a REG_DWORD and set it to 1. You may also need to do something in the NVidia control panel. Look for some reference to "VPU Recovery" in the CUDA docs.

Ideally, you should be able to break your kernel operations up into multiple passes over your data to break it up into operations that run in the time limit.

Alternatively, you can divide the problem domain up so that it's computing fewer output pixels per command. I.e., instead of computing 1,000,000 output pixels in one fell swoop, issue 10 commands to the gpu to compute 100,000 each.

The basic unit that has to fit within the time slice is not your entire application, but the execution of a single command buffer. In the AMD Stream SDK, a long sequence of operations can be broken up into multiple time slices by explicitly flushing the command queue with a CtxFlush() call. Perhaps CUDA has something similar?

You should not have to read all of your data back and forth across the PCIX bus on every time slice; you can leave your textures, etc. in gpu local memory; you just have some command buffers complete occasionally, to prove to the OS that you're not stuck in an infinite loop.

Finally, GPUs are fast, so if your application is not able to do useful work in that 5 or 10 seconds, I'd take that as a sign that something is wrong.

[EDIT Mar 2010 to update:] (outdated again, see the updates below for the most recent information) The registry key above is out-of-date. I think that was the key for Windows XP 64-bit. There are new registry keys for Vista and Windows 7. You can find them here: http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx or here: http://msdn.microsoft.com/en-us/library/ee817001.aspx

[EDIT Apr 2015 to update:] This is getting really out of date. The easiest way to disable TDR for Cuda programming, assuming you have the NVIDIA Nsight tools installed, is to open the Nsight Monitor, click on "Nsight Monitor options", and under "General" set "WDDM TDR enabled" to false. This will change the registry setting for you. Close and reboot. Any change to the TDR registry setting won't take effect until you reboot.

[EDIT August 2018 to update:] Although the NVIDIA tools allow disabling the TDR now, the same question is relevant for AMD/OpenCL developers. For those: The current link that documents the TDR settings is at https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

Related Solutions

How is CUDA memory managed

The device memory available to your code at runtime is basically calculated as

Free memory =   total memory 
              - display driver reservations 
              - CUDA driver reservations
              - CUDA context static allocations (local memory, constant memory, device code)
              - CUDA context runtime heap (in kernel allocations, recursive call stack, printf buffer, only on Fermi and newer GPUs)
              - CUDA context user allocations (global memory, textures)

if you are getting an out of memory message, then it is likely that one or more of the first three items is consuming most of the GPU memory before your user code ever tries to get memory in the GPU. If, as you have indicated, you are not running on a display GPU, then the context static allocations are the most likely source of your problem. CUDA works by pre-allocating all the memory a context requires at the time the context is established on the device. There are a lot of things which get allocated to support a context, but the single biggest consumer in a context is local memory. The runtime must reserve the maximum amount of local memory which any kernel in a context will consume for the maximum number of threads which each multiprocessor can run simultaneously, for each multiprocess on the device. This can run into hundreds of Mb of memory if a local memory heavy kernel is loaded on a device with a lot of multiprocessors.

The best way to see what might be going on is to write a host program with no device code which establishes a context and calls cudaMemGetInfo. That will show you how much memory the device has with the minimal context overhead on it. Then run you problematic code, adding the same cudaMemGetInfo call before the first cudaMalloc call that will then give you the amount of memory your context is using. That might let you get a handle of where the memory is going. It is very unlikely that fragmentation is the problem if you are getting failure on the first cudaMalloc call.

How do CUDA blocks/warps/threads map onto CUDA cores

Two of the best references are

I'll try to answer each of your questions.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

4'. There is a mapping between laneid (threads index in a warp) and a core.

5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

6'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block.

7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

See reference 2 for differences between a GTX480 and GTX560.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

8 blocks with 48 threads = 16 warps * 10 instructions = 160 instructions
64 blocks with 6 threads = 64 warps * 10 instructions = 640 instructions

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'. Parallel Nsight and the Visual Profiler show

a. executed IPC

b. issued IPC

c. active warps per active cycle

d. eligible warps per active cycle (Nsight only)

e. warp stall reasons (Nsight only)

f. active threads per instruction executed

The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.

Best Answer

Related Solutions

How is CUDA memory managed

How do CUDA blocks/warps/threads map onto CUDA cores

Related Topic