# Depth Precision

In a previous post I discussed the use of reciprocal depth $1/z$ in depth buffer generation. I gave some figures showing the problematic hyperbolical depth value distribution in the depth buffer and the dependence on the near plane.

Let’s expand a bit on that and investigate strategies to better distribute depth values. In the following I will be using a right handed coordinate system, i.e. $-z$ points forward and (as usual) vectors are multiplied from the right. View space depth is denoted $z$ and the near/far planes are $z_n$ and $z_f$.

## Standard Depth

The standard DirectX projection matrix $\mathbf{P}$ as produced by D3DXMatrixPerspectiveFovRH, transforms view space positions $\mathbf{v} = (x, y, z)$ into clip space positions $\mathbf{v'}$

$\mathbf{v'} = \mathbf{P} \mathbf{v} = \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & \frac{z_f}{z_n-z_f} & \frac{z_n z_f}{z_n-z_f} \\ 0 & 0 & -1 & 0\end{pmatrix} \mathbf{v}$

and results in depth buffer values

$z'=\frac{\frac{z_f}{z_n-z_f} z + \frac{z_n z_f}{z_n-z_f}}{-z}$

As shown before, this can cause a significant warp of the resulting depth values due to the division by $z$.

## Reverse Depth

Reverse depth aims to better distribute depth values by reversing clip space: Instead of mapping $[z_n,z_f] \mapsto [0,1]$, the projection matrix is adjusted to produce $[z_n,z_f] \mapsto [1,0]$. This can be achieved by multiplying the projection matrix with a simple ‘z reversal’ matrix, yielding

$\mathbf{v'} = \begin{pmatrix}1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & -1 & 1 \\ 0 & 0 & 1 & 0\end{pmatrix} \mathbf{P} \mathbf{v} = \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & -\frac{z_f}{z_n-z_f}-1& -\frac{z_n z_f}{z_n-z_f} \\ 0 & 0 & -1 & 0\end{pmatrix} \mathbf{v}$

The advantage of this mapping is that it’s much better suited to storing depth values in floating point format: Close to the near plane, where similar view depth values are pushed far apart by the hyperbolic depth distribution, not much floating point precision is required. It is thus safe to map these values to the vicinity of 1.0 where the floating point exponent is ‘locked’. Similar values near the far plane on the other hand are compressed to even closer clip space values and thus benefit from the extremely precise range around 0.0. Interestingly this results in the least precision in the middle area of the view space depth range.

## Linear depth (i.e W-Buffer)

As the name already suggests, the idea here is to write out the depth value itself, normalized to $[0, 1]$ range:

$z' = z_n + \frac{-z}{z_f-z_n}$

Unfortunately, without explicit hardware support, this method causes significant performance overhead as it requires depth export from pixel shaders.

## Asymptotic behaviour

For situations where extremely large view distances are required, one can let $z_f$ approach infinity. For standard depth we get

$\lim \limits_{z_f \to \infty} \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & \frac{z_f}{z_n-z_f} & \frac{z_n z_f}{z_n-z_f} \\ 0 & 0 & -1 & 0\end{pmatrix} = \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & -1 & -z_n \\ 0 & 0 & -1 & 0\end{pmatrix}$

and for reverse depth

$\lim \limits_{z_f \to \infty} \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & -\frac{z_f}{z_n-z_f}-1& -\frac{z_n z_f}{z_n-z_f} \\ 0 & 0 & -1 & 0\end{pmatrix} = \begin{pmatrix}s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & 0 & z_n \\ 0 & 0 & -1 & 0\end{pmatrix}$

Note how the matrices are suddenly free of approximate numerical computations, resulting in less rounding and truncation errors.

## Precision Analysis

With all these possibilities, which one will be the fit best for a given scenario? I try to answer this question by comparing depth resolution for each projection type. The idea is simple: View space depth is sampled in regular intervals between $z_n$ and $z_f$. Each sampled depth value is transformed into clip space and then into the selected buffer format. The next adjacent buffer value is then projected back into view space. This gives the minimum view space distance two objects need to be apart so they don’t map to the same buffer value – hence the minimum separation required to avoid z-fighting.

The following graph overlays depth resolution for the different projection methods. Click the legend on top of the graph to toggle individual curves. The slider on the bottom lets you restrict the displayed depth range. The zoom button resamples the graph for the current depth range. Use the reset button to reset the depth range.

 Near Plane Far Plane Buffer Type FloatInt24Int16

## Discussion

Results depend a lot on the buffer type. For floating point buffers, reverse depth clearly beats the other candidates: It has the lowest error rates and is impressively stable w.r.t. extremely close near planes. As the distance between the near and far plane increases $z_n-z_f$ starts to drop more and more mantissa bits of $z_n$, effectively making the projection converge gracefully to the reverse infinite far plane projection. On top of that, on the AMD GCN architecture floating point comes at the same cost as 24-bit integer depth.

With integer buffers, linear Z would be the method of choice if it wouldn’t entail the massive performance overhead. Reverse depth seems to perform slightly better than standard depth, especially towards the far plane. But both methods share the sensitivity to small near plane values.

## Reverse depth on OpenGL

In order to get proper reverse depth working on OpenGL one needs to work around OpenGL’s definition of $[-1,1]$ clip space. Ways to do so would either be via the extension arb_clip_control or a combination of gl_DepthRangedNV and a custom far clipping plane. The result should then be the same as reverse depth on DirectX. I included the Reversed OpenGL curve to visualise the depth resolution if clip space is kept at $[-1,1]$ and the regular window transform glDepthRange(0,1) is used to transform clip space values into buffer values.

# Gaussian Kernel Calculator

Did you ever wonder how some algorithm would perform with a slightly different Gaussian blur kernel? Well than this page might come in handy: just enter the desired standard deviation $\sigma$ and the kernel size $n$ (all units in pixels) and press the “Calculate Kernel” button. You’ll get the corresponding kernel weights for use in a one or two pass blur algorithm in two neat tables below.

Sigma   Kernel Size

## One dimensional Kernel

This kernel is useful for a two pass algorithm: First perform a horizontal blur with the weights below and then perform a vertical blur on the resulting image (or vice versa).

 0.06136 0.24477 0.38774 0.24477 0.06136

## Two dimensional Kernel

These weights below be used directly in a single pass blur algorithm: $n^2$ samples per pixel.

 0.003765 0.015019 0.023792 0.015019 0.003765 0.015019 0.059912 0.094907 0.059912 0.015019 0.023792 0.094907 0.150342 0.094907 0.023792 0.015019 0.059912 0.094907 0.059912 0.015019 0.003765 0.015019 0.023792 0.015019 0.003765

## Analysis & Implementation Details

Below you can find a plot of the continuous distribution function and the discrete kernel approximation. One thing to look out for are the tails of the distribution vs. kernel support: For the current configuration we have 1.24% of the curve’s area outside the discrete kernel. Note that the weights are renormalized such that the sum of all weights is one. Or in other words: the probability mass outside the discrete kernel is redistributed evenly to all pixels within the kernel.

The weights are calculated by numerical integration of the continuous gaussian distribution over each discrete kernel tap. Take a look at the java script source in case you are interested.

# Linear Depth

Something that seems to come up again and again is the topic of linear vs. non-linear depth. If you take a look at the standard DirectX projection matrix and do the math for the $z$ component, you’ll end up with something like this

$z' = \frac{z_f}{z_f - z_n} (1 - \frac{z_n}{z})$

where $z$ is the depth value before projection, $z'$ is the depth value after projection and $z_n$, $z_f$ correspond to the near and far planes. So projection actually transforms $z$ into some variation of $1/z$. The reason for this is simple: GPUs rasterize primitives in screen space and interpolate attribute data linearly in screen space as well. Linear depth $z$ in view space, however, becomes non-linear after projection and thus cannot be correctly interpolated by simple linear interpolators. Conversely, it turns out that $1/z$ is linear in screen space. This is actually quite easy to see: Assume a plane in view space

$Ax + By + Cz = D$

Perspective projection transforms view space x and y coordinates to

$x' = \frac{x}{z}, \qquad y' = \frac{y}{z}$

Inserting these equations into the original plane equation yields

$A x' z + B y' z + C z = D$

which gives us

$\frac{1}{z} = \frac{A}{D} x' + \frac{B}{D} y' + \frac{C}{D}$

clearly showing that $1/z$ is a linear function of screen space $x'$ and $y'$. This is illustrated quite nicely in this blog post by rendering ddx(z') and ddy(z') as color to the screen. The same holds for other generic attributes like texture coordinates: The GPU cannot directly interpolate $u$ and $v$, but will interpolate $u/z$ and $v/z$ instead. The attribute value will then be reconstructed per pixel by multiplying by $z$.

# Depth Precision

Now that we have established that the value that ends up in the depth buffer is not the depth but rather something related to $1/z$, one might ask what kind of effect this will have on depth precision. After all, $1/z$ is a highly non-linear function that will significantly warp the original depth values. Check out the graph below: I plotted the resulting $z'$ for the view space depth range $z \in \{0,\dots,100\}$ for different near plane values $z_n$:Notice how steep the function is on the first couple of meters. Almost the entire interval $z'\in\{0,\dots,0.99\}$ is spent on the first couple of meters.

In order to test this result empirically I wrote a small program that will sample the range $z \in \{z_n,\dots,z_f\}$ in regular intervals on the GPU, calculate the depth value $z'$ after projection and write it to some depth buffer of choice. The buffer is then read back to the CPU and view space depth is reconstructed for each sample. This allows us to calculate the error of original depth value vs. reconstructed depth value. Here are the results for the formats DXGI_FORMAT_D16_UNORM and DXGI_FORMAT_D32_FLOAT with the following configuration: $z_n = 0.1$, $z_f = 10000$:
Note how the error for DXGI_FORMAT_D16_UNORM quickly approaches ridiculous proportions; 16 bit integer depth in combination with a projective transform is definitely a no go! Here’s another plot to illustrate the error of DXGI_FORMAT_D32_FLOAT in more detail:Much better, though at the extremes we still get an error of over 100 meters. With some care though, this can greatly reduced: The shape of the hyperbolic $z'$ curve is largely determined by the near plane distance $z_n$. Even a slight change from $z_n=0.1$ to $z_n=0.25$ reduces the maximal error from $1.4\%$ down to $0.26\%$.

I also tested DXGI_FORMAT_D24_UNORM_S8_UINT but the results were so close to DXGI_FORMAT_D32_FLOAT that I can only conclude that the driver internally maps the depth format to 32 bit float. Not that much of a surprise, this is exactly what the the AMD GCN architecture does as well.

# Practical Considerations

• First of all: Make sure that your near plane is as far away from the camera as you can afford it. This will flatten the hyperbolic $1/z$ curve and provide much better depth precision far away from the viewer.
• Unless you are in some crazy setting with hundreds of kilometers view distance and you are going for sub centimeter depth resolution, DXGI_FORMAT_D32_FLOAT should be good enough and on modern GPUs should come at no additional cost compared to DXGI_FORMAT_D24_UNORM_S8_UINT.
• DXGI_FORMAT_D16_UNORM isn’t really a choice for projective transforms. It can be quite valuable for orthographic projections though (for example sun shadow maps), reducing bandwidth by half compared to a 32 bit format.

# Linear Depth

And if you really really need linear depth you can write it via the SV_DEPTH semantic in the pixel shader. Beware though, you’ll loose the early Z unless you use the variant SV_DepthGreater, or SV_DepthLessEqual. Check out this blog post for more details. In most cases though I would argue that non linear depth is just fine.

This time, let’s look at how we can add some smoothness to our shadows. There exist a variety techniques to do so, the most popular being percentage closer filterig (PCF). The idea is simple: instead of sampling the shadow map only once and getting a binary result (shadow or no shadow), we also sample the surrounding shadow map texels and average the resulting shadow/no shadow decisions like so:

const int halfkernelWidth = 2;

float result = 0;
for(int y=-halfkernelWidth; y<=halfkernelWidth; ++y)
{
for(int x=-halfkernelWidth; x<=halfkernelWidth; ++x)
{
float texCoords = float4(splitInfo.TexCoords + float2(x,y) / ShadowMapSize, 0, 0);
result += (splitInfo.LightSpaceDepth <  tex2Dlod( ShadowMapSampler, texCoords ).r);
}
}
result /= ((halfkernelWidth*2+1)*(halfkernelWidth*2+1));


Note that there exist a couple of more efficient and better looking sampling patterns but for the sake of simplicity I’m sticking with the above pattern. In any case, the averageing of yes-no decisions yields a result that can only take on a set of discrete values (25 for the above example), which means that our shadow levels will be more or less heavily quantized. Check out the image below: On the left the shadowed scene and on the right a closeup of a shadow border with the above PCF filter kernel and the corresponding pixel value histogram. You can clearly see the shadow value quantization in the image as well as in the pixel color histogram.

While this method produces good (with some tweaking excellent) results, it is rather sample hungry: In the above case we need a total of 25 samples for each shadow lookup, which is just too much for less powerful graphics hardware. So let’s look at the alternatives: Instead of using a regular grid sampling pattern, let’s switch to a stochastic sampling pattern: Pick a set of sampling positions from a disk with center at the current pixel and sample only those. A commonly used sampling pattern in this context is the poisson disk: its sampling positions are chosen uniformly at random but with a distance constraint. It can be shown that this sampling pattern has some highly desirable properties like most of it’s energy concentrated in the high frequencies. In the images below you can see the filter kernel on the left (12 taps) and the resulting shadow border on the right.

Notice how the shadow border became much more concentrated and somewhat less blocky. Still not quite satisfying though. So let’s apply another approach, proposed by [Mittring07]: Let’s apply a random rotation to the filter kernel before sampling. This will change up the sampling positions from pixel to pixel, reducing artifacts on neighbouring pixels. Note that randomness in shaders is a tricky issue: Each pixel is shaded independently of the surrounding pixels so traditional random number generators that depend on the result of the previously drawn random number cannot be used here. And usually, it is a requirement that the generated random number be stable with respect to the camera: You want to avoid changing your random number and with it your shading result when the camera stands still otherwise you might perceive flickering even in still scenes. Ideally your random number should not change even when the camera moves. I chose to precalculate a set of random numbers at application start and upload them via a texture. The texture is then indexed via the fragment’s world space position, which guarantees stability with a still standing camera. Since each random number represents a rotation angle $\phi$ and all we do is rotate the filter kernel, I directly store $cos(\phi)$ and $sin(\phi)$ in the texture so we ony need to do a multiply-add in the fragment shader.

// generate a volume texture for our random numbers
mRandomTexture3D = new Texture3D(mGraphicsDevice, 32, 32, 32, false, SurfaceFormat.Rg32);

// fill with cos/sin of random rotation angles
Func<int, IEnumerable<UInt16> > randomRotations = (count) =>
{
return Enumerable
.Range(0,count)
.Select(i => (float)(random.NextDouble() * Math.PI * 2))
.SelectMany(r => new[]{ Math.Cos(r), Math.Sin(r) })
.Select( v => (UInt16)((v*0.5+0.5) * UInt16.MaxValue));
};

mRandomTexture3D.SetData(randomRotations(mRandomTexture3D.Width * mRandomTexture3D.Height * mRandomTexture3D.Depth).ToArray());


The fragment shader then looks up the rotation values based on the fragment’s world position and rotates the poisson disk before doing the shadow map look-ups:

// get random kernel rotation (cos, sin) from texure, based on fragment world position
float2 randomValues = tex3Dlod(RandomSampler3D, randomTexCoord3D).rg;
float2 rotation = randomValues * 2 - 1;

float result = 0;
for(int s=0; s<numSamples; ++s)
{
// compute rotated sample position
float2 poissonOffset = float2(
rotation.x * PoissonKernel[s].x - rotation.y * PoissonKernel[s].y,
rotation.y * PoissonKernel[s].x + rotation.x * PoissonKernel[s].y
);

const float4 randomizedTexCoords = float4(splitInfo.TexCoords + poissonOffset * PoissonKernelScale[splitInfo.SplitIndex], 0, 0);
result += splitInfo.LightSpaceDepth <  tex2Dlod( ShadowMapSampler, randomizedTexCoords).r;
}

// normalize yes/no decisions and combine with ndotl term
float shadowFactor = result / numSamples * t;


And here are the results:

To the eye, the transition between shadow/no shadow looks much smoother than with the 5×5 block kernel – even though the shadow levels are even more quantized: check the histogram, there are only 12 different shadow levels. The scene itself doesn’t necessarily look better in total but once the textures and lighting are added things change drastically: the ‘noise’ disappears mostly in the surface detail and is quite less noticable except on very uniformly shaded surfaces.

Note that I held back some detail: how do we calculate the ndotl based term: t? Up to now we set t = (ndotl > 0) but this doesn’t work well together with the stochastic sampling approach shown above: (ndotl > 0) produces hard shadow edges whereas the rotated poisson disk gives us a dithered looking fall-off. Combining those two looks quite weird as shown below. So we need to dither ndotl as well:

float l = saturate(smoothstep(0, 0.2, ndotl));
float t = smoothstep(randomValues.x * 0.5, 1.0f, l);


So instead of creating a hard edge by computing ndotl > 0, we use the smoothstep function to create a smooth transition between 0 and 1 when ndotl lies in the range $[0, \dots, 0.2]$ and store it in the variable l. We then use our random value randomValue.x and l and feed them into the smoothstep function again. If l lies in the range $[\text{randomValue.x}, \dots, 1]$ it will be smoothly interpolated between 0 and 1, based on how close it is to either randomValue.x or 1. But since l represents a smooth transition between 0 (shadow) and 1 (no shadow) and randomValue.x (ideally) follows a uniform distribution, this means that the further l moves away from shadow , the more likely t will receive the value 1 (no shadow) too. Conversely, the closer l gets to the shadow border, the more likely t will be 0 (shadow). But there is still a random factor in there which can cause t to be 1 even if l is 0.. but very unlikely. The two images below illustrate the effect dithering on ndotl: On the left the scene is rendered without, and on the right with ndotl dithering.

Stay tuned for some more images and videos in the next post!