Getting the most out of Kinect's camera

Getting the most out of Kinect's camera

By Nicola Salmoria

June 29th 2011 at 12:00PM

Lightning Fish offers in-depth dev insight into getting the most from tracking cameras

With the recent introduction of Kinect for the Xbox 360 the industry has seen something of a surge of interest in the field of camera based player tracking. To this end, Kinect comes pre-loaded with an impressive array of aids like traditional RGB, infra-red depth and microphone sensors.

There really is an impressively large amount that can be achieved with the basic RGB cameras, including the fact that you can use it to enhance the depth camera information on Kinect.

This article aims to show you how to get the best results, by giving you an understanding of RGB camera noise and where it comes from.

If, for example, you want to separate out the moving objects in a scene (such as the players), a good place to start is with a per pixel image of the non-moving background. The moving objects – or foreground – are then the pixels in the scene which don’t match your model of the background. However, camera noise means that even pixels which are in the static background can appear to change.

The better your estimate of the noise is, the more useful your background model will be (see Figure 1).

UNDERSTANDING CAMERAS

There are several different stages within the process which converts the light in a scene into a camera image. First, light passes through the lens and is converted to electrical charges.

This signal is then amplified, sampled by an analogue-to-digital converter, and processed in the digital domain to apply white balance and gamma corrections.

The sensor also needs to separate the red, green and blue colour channels. In almost all modern consumer cameras this is done using something called the Bayer Filter Mosaic (see Figure 2).

A colour filter is placed over the camera sensor, so that each photosensor receives light of only one of the three primary colours, in a repeating pattern. 50 per cent of the photosensors receive green light, 25 per cent red and 25 per cent blue.

The final colour image is produced in software by an algorithm called demosaicing, which derives the missing colour components of each pixel from the neighbouring pixels. So you could say that the effective resolution of a three megapixel camera is really only one megapixel.

For vision processing algorithms, demosaicing isn’t a good thing. It increases the amount of data to process without introducing any new information, and may actually degrade the original data, depending on the type of interpolation used.

Fortunately, the raw Bayer image produced by the camera is often accessible. In this case, it’s certainly worth using the raw image directly, since you will get more accurate results and use less processing power.

The one thing that you need to keep in mind is that the raw Bayer image isn’t a real grayscale image; it’s a filtered image, so you need to be aware of which filter colour corresponds to each pixel.

UNDERSTANDING NOISE

Whether you happen to be using either the raw Bayer data or the processed output from a camera driver, your image will invariably contain some random noise.

Even if the scene which the camera saw was a perfectly static one, and the camera was completely accurate, the number of photons hitting the sensor would vary from one image to the next (This effect is known as photon shot noise in quantum physics). The camera circuitry will also add a fair amount of noise of its own.

If you are identifying moving foreground by comparing the camera input to a static background image, a good way of dealing with noise is to specify a threshold which quantifies how much the values of pixels can vary from those of the corresponding pixels in the background before they are categorised as foreground.

This prevents you from automatically identifying ‘noisy’ pixels in the background area of the image as foreground.

To get the best results, the threshold should depend on the actual level of noise. A theorem in probability theory, called Chebyshev’s inequality, can be used to pick a suitable value. This theorem guarantees that in any data sample, no more than 1/k2 of the samples can be more than k standard deviations away from the mean.

For example, no more than 1/9th of the true background pixels can be further than three standard deviations from the background model. So if you know the standard deviation of the noise, you can set the threshold to the number of standard deviations which corresponds to the error rate – the number of true background pixels which will be wrongly classified as foreground – you’re happy with.

If you use a very high error rate, background pixels which are not very noisy compared to the current level of noise in the scene will be classified as foreground. Similarly, if you use a very low rate, pixels which are actually foreground will often be categorised as background. The best choice of error rate will be somewhere in the middle.

To use this approach, you need to estimate the standard deviation of the camera noise. The first thing to do is lock the gain, white balance and exposure settings on the camera, since variations in these will cause the noise level to change.

You might expect that the next step is to find a noise level for the camera image as a whole. This isn’t the best solution, however, since noise varies across the image, and in particular noise depends on pixel brightness. The most important reasons for this are:

* Photon shot noise is proportional to the square root of the number of photons hitting the sensor
* Gamma correction in the camera enhances dark pixels, which increases their ‘noisiness’

That means you need to determine the camera noise independently for each of the possible brightness values. You also want to be able to do this robustly regardless of what happens in the scene, because you can’t expect the camera image to be perfectly static. Remarkably, this can be done in a few simple steps.

Record the mean and variance of the brightness for every pixel in the camera image over a few frames.
Next, calculate the median of the recorded variance for each of the 256 levels of each individual colour channel. The median is used because it is less affected by occasional outlying values than the mean. This gives an estimate for the noise variance at that brightness. The standard deviation is just the square root of the variance.

Repeat the previous steps, replacing the previous estimates only when the new estimates are lower. This ensures that when the initial noise estimates are too high because there was a lot of movement in the scene, or if estimates weren’t available at all because there were no pixels of that brightness in the scene, better estimates will be adopted as soon as they become available.

The algorithm described above must be applied to each of the three colour channels independently, because the white balance settings on the camera will typically cause it to amplify each channel – and its corresponding noise – differently.

If you have the raw Bayer image, each pixel has a single colour so you don’t have to worry about anything else.
If you are using the RGB image, however, things are made significantly more complicated again. The de-mosaicing algorithm interpolates the values of neighbouring pixels, and does so in a way which depends on the filter colour.

Since the type of interpolation will affect the ‘noisiness’ of the pixels, you must look not only at the colour channel but also at the filter colour. So you will need a total of nine groups, one for each possible combination of colour channel and colour filter (see Figure 3).

Tracking players reliably with a traditional RGB camera is difficult, but it can be done. You can also use these techniques to enhance the depth information using the RGB camera on the Kinect.

The first step in building a system which can achieve this is to understand the camera you’re using, and the sources of the noise in its output. Once you understand the noise, you can model it. When you have a good model of the noise, you can use that model to get more stable and accurate results from your vision system, no matter what algorithms you use to follow the moving objects in the scene.