Essential Audio Knowledge: Spectral Masking
March 23, 2017
Background noise is an unavoidable nuisance. Whether it’s produced by roaring excavators or hurried crowds at a train station, and whether it’s obscuring a voice message from your boss or a listening session with a Tchaikovsky waltz, it’s always annoying. Given urbanization trends, our environments will likely only get noisier—with more people living closer together, and the proliferation of mobile sound systems providing a whole new palette of noises and disturbances which were not present in the listening scenarios of yesteryear.
In a previous blog post, we talked about the need for dynamic range compression in mobile devices, and how we must raise the weak parts of an audio signal above the threshold of audibility in order to make it audible. That threshold is, in fact, frequency dependent, and further depends on other undesired sounds that might be present. Which leads us to the topic of today: the masking phenomenon. We’ll talk about what it is, and why it makes sense to take dynamics processing to the frequency-domain.
The threshold of hearing in quiet surroundings has been experimentally determined, and is an average over many subjects. This threshold is what audiologists try to measure using tones at different levels and frequencies. The higher threshold levels at low frequencies indicate that the human ear is less sensitive there, whereas it is most sensitive in the speech region where the threshold is at its lowest.
Another important concept is that of equal loudness contours, which reveal how the level of a sinus tone of perceived constant loudness varies over frequency.
The general shape of the equal loudness contours is a hint that a music signal perceived as being tonally balanced will have a sloping spectrum with much more energy at lower frequencies. Problematically, small loudspeakers have great difficulties precisely reproducing those frequencies that the human ear needs more of.
Furthermore, it is interesting to note that the sensitivity is highest around 4 kHz, a frequency region where low-level, unvoiced components of human speech often reside. Human hearing is clearly optimized for detecting the voices of fellow humans.
But why and how can the presence of an undesired sound make a desired sound inaudible? The answer lies in the psychoacoustical phenomenon of masking. Inside the ear, behind the eardrum, is an amazingly elaborate mechanical system. The eardrum is connected via the middle ear to a liquid-filled, snail-shaped structure called the cochlea, which receives the vibrations picked up by the eardrum. The coiled canal of the cochlea is divided into two chambers separated by the so-called basilar membrane. When excited with a tonal signal of a certain frequency, the basilar membrane resonates maximally at a particular point along the membrane. That location of maximum resonance corresponds to the frequency of the tone, and the movement of the membrane triggers auditory nerves which, in turn, transmit signals to the brain, informing it of the presence of a tone at the frequency in question. The cochlea functions, in fact, as a spectrum analyzer with amazing precision and accuracy.
Since a resonance on the basilar membrane is not perfectly localized to a single point (it is similarly impossible to make a guitar string oscillate only at a single point), a range of nerves are being triggered. The result is an increase of the threshold of audibility around the frequency of the tone. This means that a second tone, close in frequency to the first, would need to have a higher level in order to be heard, in comparison to a tone heard in isolation. If it were to have a level lower than the threshold, it would be masked by the first tone.
This basic concept can be extended to more complex signals, and understanding the masking principle can, for example, be used to “hide” distortion or determine the gain required to make one sound audible in the presence of another.
Dynamic range processing can be taken to another level of flexibility and sophistication by employing a frequency-domain processing framework. This can be done by emulating the time/frequency analysis carried out by the human ear; for example, by using a sliding window FFT (Short-Time Fourier Transform) to analyze the signal and calculate an optimal frequency-dependent gain.
To apply just the right amount of gain at different frequencies, more or less sophisticated models of human hearing, loudspeakers, and background noise can be integrated. Fortunately for the users of small loudspeakers, there is plenty that can be done to optimize any signal for playback devices and the playback environment. Unmasking the music using frequency-dependent dynamic range compression is just one of many possibilities…
– Mattias Arlbrant, Senior Research Engineer at Dirac Research