Introduction and Background |
The ability to locate sound sources through sonic cues is of critical importance to many animals. Even though humans do not hunt for food by sound alone as some animals do, we demonstrate a reasonable ability to localize sounds. The importance of this ability has long been recognized and researchers began investigating the sonic cues that we use for localization hundreds of years ago. This topic has received an increase in attention over the last several years with the development of more advanced computers, senstive measurement devices, and an increased interest in virtual environments.
One of the initial theories for explaining sound source localization has been labeled the "duplex theory". This duplex theory attributes our localization abilities to two different interaural cues: interaural time differences (ITD) and interaural level differences (ILD).
![]() |
| Figure 1 The distance from the sound source to each ear is different therefore the sound is delayed longer in reaching the far ear. This helps humans perceive the spatial location of the sound. |
With two different ears receiving input from the same sound source, the sound will have to travel farther in many cases to reach one of the two ears. Additionally, the ear that is farther from the sound source will often not have a direct path for the sound wave to follow because the head is in the way. This is shown in figure 1. This is going to result in the ear farther from the sound source receiving a copy of the sound a little later that the near ear, creating an interaural time difference between the ears (ITD). Also, the sound wave reaching the far ear will be attenuated compared to the near ear, creating a interaural level difference between the ears (ILD).
We generally perceive a sound source to be originating closer to the ear where it arrives earliest and with the greatest amplitude. If we vary the sound source location by azimuth (lateral position as related to directly in front of the listener), the ITDs and ILDs would vary as well.
If the head were not in the way, it would be a trivial matter to calculate the additional distance the sound would have to travel in order to reach the far ear, and then calculate the appropriate ITD and ILD. The problem is significantly complicated by the presence of the head however, as there may not even be direct paths from the source to the far ear. Instead you have to account for the waves reaching the ears after propagating along the surface of an irregular object.
Lord Rayleigh first proposed a solution to this problem by modeling the head as a sphere and solving the equations for wave propagation around this rigid sphere. His solution highlighted some interesting aspects of the ITDs and ILDs.
First, ILDs are not linear with frequency. Low frequencies do not appear to be shadowed much by the head because they have a wavelength which is greater than the diameter of the head. Frequencies above 1500 Hz have a wavelength smaller than the diameter of the head, and can be attenuated a great deal. It follows then that high frequency components are much more important to us when evaluating ILDs because they are most affected by the head shadowing resulting from the location of the sound source.
Secondly, for frequencies above 1500 Hz ITDs become ambiguous. When the wavelength of the sound wave is smaller than the diameter of the head, an ITD is greater than one period of the wave. In this case, comparing the phase difference between the waves arriving at each ear does not provide a unique ITD because an aliasing effect occurs. It follows from this that low frequency components of a sound are much more important to us in evaluating ITDs because the phase difference between the ears can still provide us with a unique ITD. Figure 2 shows the frequency response of the Rayleigh head model.
![]() |
| Figure 2 This is the frequency response of the Rayleigh head model plotted against normalized frequency (w*a/c) where a is the diameter of the head and c is the speed of sound. The model is plotted for several values of theta, where theta is the DIFFERENCE between the azimuth of the source and the location of the ear. Notice for sounds originating on the far side of the head from the ear (theta > 100), the model acts as a lowpass filter. (Brown and Duda 1998). |
The Rayleigh head model is appealing because it is relatively simple and it has proven to be successful in simple localization experiments. However, it has long been recognized that this is not a complete model of our sonic cues for localization. It is apparent even considering a constant elevation of the sound source that there are multiple locations which would produce identical ITDs and ILDs. If the elevation of the sound source is allowed to vary, the problem is compounded and we get a whole cone of points in three dimensional space that would produce the same ITDs and ILDs. This phenomenon has been labeled the "cone of confusion" because of the ambiguity of the ITDs and ILDs that are generated from these locations. Figure 3 illustrates the "cone of confusion" where different spatial locations would produce identical ITDs and ILDs.
![]() |
| Figure 3 In the "cone of confusion", different spatial locations produce identical ITDs and ILDs. |
The common belief is that the shape of the head, torso, and especially outer ear (pinna) effect the spectral contents of a sound as it propagates from free space to the inner ear. Additionaly, because of the irregular shape of the upper body and pinna it is theorized that these filtering effects would be spatially dependent, varying with the location of the sound source. This filtering effect is commonly referred to as a "Head-Related Transfer Function" (HRTF). Because the HRTF would vary with elevation and azimuth, this would create a unique HRTF for each sound source location.
Each individual has unique physical shapes and characteristics, so each individual will also have a unique set of HRTFs. These HRTFs can be measured for an individual by placing small microphones deep in the ear canal and playing bursts of broadband noise at different spatial locations around the fixed head. A sample of these measurements are shown in figures 4 and 5. The HRTF for each ear and for each recorded location can be computed knowing the signal that was presented over the speakers and the recorded signal from inside the ear canals.
|
|
These computed HRTFs for a location would contain both the ITD and ILD cues (described in the duplex theory earlier) and the spectral filtering effects, primarily due to the pinna. Because the elevation of the sound source relative to both ears will be the same, the spectral filtering effects would also be expected to be very similar. With this spectral cue being roughly the same at both ears, we don't have a significant interaural frequency difference to compare between the ears. This explains the empirical observation that, lacking visual cues, we are much better at localizing sound in the azimuthal plane than in the elevation plane.
When we hear sounds presented over headphones, most people would describe those sounds as originating from inside the head. If our measured HRTF contains all of the information we use to localize a sound, presumably we could spatialize a sound by filtering it using these measured HRTFs for an individual for a given location. When this filtered sound is played over headphones to the individual it should sound as if it were coming from a specific spatial location outside of the head. Indeed researchers have been able to produce this effect in laboratory experiments.
Some research groups have investigated the effects of synthesizing spatial audio using distorted versions of the measured HRTFs. One group of researchers repeated experiments using progressively "smoothed" versions of the individual's HRTFs. It was found that the HRTFs could be smoothed drastically, retaining only the gross features of the original measurements, and still be surprisingly effective at generating spatialized sound. A sample of a smoothed HRTF from that experiment is depicted in figure 6. Other researchers have used HRTF measurements from one individual to synthesize spatial audio for another individual. While these subjects did not preform as well at localization tasks as when their own HRTF measurements were used, they were still able to preform reasonably well.
![]() |
| Figure 6 Spectral smoothing of an individual's HRTF. Even with a large, amount of smoothing, the HRTF was effective at generating spatialized sound. (Kulkarni and Colburn 1998). |