Overhaul of the Integration Kernels

Since the beginning of Octane the integration kernels had one CUDA thread calculate one complete sample. We changed this for various reasons, the main one being the fact that the integration kernels got really huge and impossible to optimize. Also OSL and OpenCL are pretty much impossible to implement this way. To solve the problem, we split the big task of calculating a sample into smaller steps which are then processed one by one by the CUDA threads. I.e. there are a lot more kernel calls are happening than in the past.


There are two major consequences coming with this new approach: Octane needs to keep information for every sample that is calculated in parallel between kernel calls, which requires additional GPU memory. And the CPU is stressed a bit more since it has to do more work to do many more kernel launches. To give you some control over the kernel execution we added two options to the direct lighting / path tracing / info channel kernel nodes:


+ "Parallel samples" controls how many samples we calculate in parallel. If you set it to a small value, Octane requires less memory to store the samples state, but most likely renders a bit slower. If you set it to a high value, more graphics memory is needed rendering becomes faster. The change in performance depends on the scene, the GPU architecture and the number of shader processors the GPU has.

+ "Max. tile samples" controls the number of samples per pixel Octane renders until it takes the result and stores it in the film buffer. A higher number means that results arrive less often at the film buffer, but reduce the CPU overhead during rendering and as a consequence can improve performance, too.


Comparison for VRAM/RAM Usage Capabilities

Here is the comparison table between V2 and V3


V2

V3

Render buffers

VRAM(gpu)

VRAM(gpu)+RAM(system)

Textures


out-of-core


out-of-core

VRAM

VRAM+RAM

VRAM

VRAM+RAM

Geometry

VRAM

Triangles count: Max. 19.6 millions

VRAM

Triangles count: Max 76 millions



Speed

It's hard to quantify the performance impact, but what we have seen during testing is that in simple scenes (like the chess set or Cornell boxes etc.) the old system was hard to beat. That is because in this type of scenes, samples of neighbouring pixels are very coherent (similar) which is what GPUs like and can process very fast, because CUDA threads did almost the same task and didn't have to wait for each other. In these cases you usually have plenty of VRAM left, which means you can bump up the "parallel samples" to the maximum making the new system as fast or almost as fast as the old system.


The problem is that in real production scenes the execution of CUDA threads diverges very quickly causing CUDA threads to wait a long time for other CUDA threads to finish some work, i.e. twiddling thumbs. And for these more complex scenes the new system usually works better since the coherency is increased by the way how each step is processed. And we can optimize the kernels more, because the scope of their task is much more narrow. So you usually see a speed up for complex scenes, even with the default parallel samples setting or a lower value (in case you are struggling with memory).


TLDR Version

In simple scenes where you've got plenty of VRAM left: Increase "parallel samples" to the maximum.

In complex scenes where VRAM is sparse: Set it to the highest value without running out of memory. It should usually still be faster than before or at least render with roughly the same speed.


Moved Film Buffers to the Host and Tiled Rendering

The second major refactoring in the render core was the way we store render results. Until v3 each GPU had its own film buffer where part of the calculated samples were aggregated. This has various drawbacks: For example, a CUDA error usually means that you lose the samples calculated by that GPU or a crashing/disconnected slave means you lost its samples. Another problem was that large images mean a large film buffer, especially if you enable render passes. And yes, deep image rendering would have been pretty much impossible since it's very very memory hungry. And implementing save and resume would have been a pain.


To solve these issues we moved the film buffer into host memory. Doesn't sound exciting, but has some major consequences. The biggest one is that now Octane has to deal with a huge amount of data that GPU's produce. Especially in multi-GPU setups or when network rendering is used. As a solution, we introduced tiled rendering for all integration kernels except PMC (where tiled rendering is not possible). The tiles are relatively large (compared to most other renders), and we tried to hide tile rendering as much as we can.


Of course, the film buffer in system memory means more memory usage, so make sure that you have enough RAM installed before you crank up the resolution (which is now straight forward to do). Another consequence is that the CPU has to merge render results from the various sources like local GPUs or net render slaves into the film buffers which requires some computational power. We tried to optimize that area, but there is obviously an impact on the CPU usage. Let us know if you run into issues here. Again, increasing the "max. tile samples" option in the kernels allows you to reduce the overhead accordingly (see above). Info passes are now rendered in parallel, too, since we can now just reuse the same tile buffer on the GPU that is used for rendering beauty passes.


Overhauled Work Distribution in Network Rendering

We also had to modify how render work is distributed to net render slaves and how their results are sent back, to make it work with the new film buffer. The biggest problem to solve was the fact that transmitting samples to the master is 1 to 2 magnitudes slower than generating them on the slave. The only way to solve this is to aggregate samples on the slaves and de-coupling the work distribution from the result transmission, which has the nice side effect that while rendering large resolutions (like stereo GearVR cube maps) doesn't throttle slaves anymore.


Of course, caching results on the slaves means that they require more system memory than in the past and if the tiles rendered by a slave are distributed uniformly, the slave will produce a big pile of cached tiles that needs to be be transmitted to the master eventually. I.e. after all samples have been rendered, the master still needs to receive all those cached results from the slaves, which can take quite some time. To solve this problem we introduced an additional option to the kernel nodes that support tiled rendering:

"Minimize net traffic", if enabled, distributes only the same tile to the net render slaves, until the max samples/pixel has been reached for that tile and only then the next tile is distributed to slaves. Work done by local GPUs is not affected by this option. This way a slave can merge all its results into the same cached tile until the master switches to a different tile. Of course, you should set the maximum samples/pixel to something reasonable or the network rendering will focus on the first tile for a very long time.