Acoustic localization of a speaker |
|||||||
申请号 | EP07007817.5 | 申请日 | 2007-04-17 | 公开(公告)号 | EP1983799B1 | 公开(公告)日 | 2010-07-07 |
申请人 | Harman Becker Automotive Systems GmbH; | 发明人 | Buck, Markus; Haulick, Tim; Schmidt, Gerhard; Wolff, Tobias; | ||||
摘要 | |||||||
权利要求 | |||||||
说明书全文 | The present invention relates to the art of the localization of speakers, in particular, speakers communicating with remote parties by means of hands-free sets or speakers using a speech recognition means. Particularly, the present invention relates to the localization of a person or a speaker by means of the transmission and reception of acoustic signals. The localization of one or more speakers (communication parties) is of importance in the context of many different electronically mediated communication situations where multiple microphones, e.g., microphone arrays or distributed microphones are utilized. For example, the intelligibility of speech signals that represent utterances of users of handsfree sets and are transmitted to a remote party heavily depends on an accurate localization of the speaker. If accurate localization of a near end speaker fails, the transmitted speech signal exhibits a low signal-to-noise ratio (SNR) and may even be dominated by some undesired perturbation caused by some noise source located in the vicinity of the speaker or in the same room in which the speaker uses the hands-free set. Audio and video conferences represent other examples in which accurate localization of the speaker(s) is mandatory for a successful communication between near and remote parties. The quality of sound captured by an audio conferencing system, i.e. the ability to pick up voices and other relevant audio signals with great clarity while eliminating irrelevant background noise (e.g. air conditioning system or localized perturbation sources) can be improved by a directionality of the voice pick up means. In the context of speech recognition and speech control the localization of a speaker is of importance in order to provide the speech recognition means with speech signals exhibiting a high signal-to-noise ratio, since otherwise the recognition results are not sufficiently reliable. Acoustic localization of a speaker is based on the detection of transit time differences of sound waves representing the speaker's utterances and allows for the determination of the direction of the speaking person. The determination of the distance of the speaking person is more difficult since the speaker may be in the far field of the array. Therefore a big spatial dimension of the array or distributed microphones is necessary to detect the distance. In the art, microphone arrays are used in combination with beamforming means (see, e.g., The beamformer can be an adaptive weighted sum beamformer that combines preprocessed, in particular, time delayed microphone signals xT,m of M microphones to obtain one output signal Yw with an improved SNR: Beamforming must be temporally adapted in the case of a moving speaker. In this case, the weights am are not time-independent as in a conventional delay-and-sum-beamformer, but have to be recalculated repeatedly as is required, e.g., to maintain sensitivity in the desired direction and to minimize sensitivity in the directions of noise sources. However, such kind of localization of a speaker can only be performed when the speaker is actually speaking. This implies that, in particular, in the case of a quickly moving speaker targeting the speaker by the beamforming means needs some time after each speech pause thereby easily resulting in some distorted transmission of speech signals representing the beginning of an utterance after a speech pause made by the speaker. Moreover, the above-mentioned method for localization of a speaker is error-prone in acoustic rooms that exhibit a significant reverberation. Therefore, there is a need for a method for a more reliable localization of a speaker that, in particular, does not depend on the speaker's actual utterances. In the following a method for localization of a person (a speaker) in a room in that at least one loudspeaker and at least one microphone array are located not being part of the invention but helpful for understanding the same is described. According to the comparative example, the method comprises outputting sound by the at least one loudspeaker such that the sound is at least partly reflected by the person (the speaker); detecting the sound output by the at least one loudspeaker and at least partly reflected by the person (the speaker) by the microphone array to obtain microphone signals for each of the microphones constituting the microphone array; and determining the person's (speaker's) direction towards and/or distance from the microphone array on the basis of the microphone signals. The position of the speaker is determined by the detection of the sound (acoustic signal) output by the at least one loudspeaker that is reflected by the body of the speaker and not by the speakers actual utterance as in the art. The reflections arrive after different sound transit times and from different directions at the microphone array. By the speaker's body reflections of the sound emitted by the at least one loudspeaker both the speaker's direction towards and distance from the microphone array can be determined. Herein, by the expression "speaker" a person is meant that may or may not be actually speaking (at least he is expected to speak at some time, since he is using some communication system). The localization of the speaker can be completed before he actually starts speaking and, thus, the relevant parameters for the speech signal processing which depend on the speaker's position (e.g., the steering angle of a beamformer) can be adapted before the speaker starts speaking. Thereby, it is guaranteed that the very beginning of the utterance can readily be transmitted to a remote party with a high SNR or that a speech recognition means is enabled to reliably recognize the very beginning of a detected verbal utterance. The method can easily be implemented without significant costs in communication systems that already include at least one loudspeaker and a microphone array as, e.g., audio or video conference rooms or living rooms provided with an advanced voice control for a HIFI device. Commonly available handsfree sets also include a loudspeaker and a microphone array and can easily adapted for implementation of the inventive method. In particular, more than one loudspeaker can be used each emitting sound in form of an audio signal that is uncorrelated from the audio signals emitted form the other ones of the loudspeakers. For each of the emitted audio signals the direction and distance of the speaker can be determined from the reflections by the speaker's body and some average value can be determined from the results for each of the uncorrelated audio signals emitted by the multiple loudspeakers. Thereby, a more robust and reliable localization of the speaker may be achieved. No special signal is required. The signal from the remote party or the music playback can be used. Thus, the localization operates without any notice by the user. The method not being part of the invention further comprises beamforming of the microphone signals to obtain at least one beamformed signal and wherein the speaker's direction towards and/or distance from the microphone array is determined on the basis of the at least one beamformed signal. The beamforming can be performed by means of a delay-and-sum beamformer or a filter-and-sum beamformer (see, e.g., " By beamforming the microphone signals for different directions the direction from which sound arrives at the microphone array that is reflected by the speaker can readily be detected, since sound coming from this direction shows a higher energy level of the sound (and, thus, a higher SNR) than coming from a different direction apart from the direction towards the loudspeaker itself. The method not being part of the invention may comprise estimating the impulse responses of the loudspeaker - room - microphone system (or the transfer functions for processing in Fourier space) for at least some of the beamformed signals in which case the speaker's direction towards and/or distance from the microphone array can be determined on the basis of the estimated impulse responses. As known in the art, the impulse responses for different directions of the employed beamforming means represent a reliable measure for the energy levels of sound coming from the different directions and, thus, allow for a reliable localization of the speaker. The estimation of the impulse responses can be performed, e.g., by an echo compensation filtering means as known in the art (see, e.g., In particular, the energy responses can be determined from the estimated impulse responses by calculating the squared magnitude of the impulse responses. By means of these energy responses a direction - distance diagram can be generated and properly used for the localization of the speaker (see also detailed discussion below and, in particular, For example, the localization of the speaker can be achieved by simply determining the local maxima of the generated direction - distance diagram and assigning one of the local maxima to the speaker's position. In principle, different local maxima (representing sound sources) are present in the direction - distance diagram due to the loudspeaker itself, reflecting walls of the acoustic room wherein the speaker and the loudspeaker as well as the microphone array are located an the speaker. The stationary maxima can be determined beforehand (without any speaker present in the acoustic room). In particular, a reference direction - distance diagram can be generated and stored representing the energy responses for the acoustic room without any person. In this case, the method not being part of the invention comprises generating a reference direction - distance diagram and subtracting the direction - distance diagram and the reference direction - distance diagram from each other to obtain a differential direction - distance diagram and the speaker's direction towards and/or distance from the microphone array is determined on the basis of the differential direction - distance diagram, e.g., by determining the local maxima of the energy responses. The energy responses h(k) are smoothed over k (i.e., within the impulse response interval, where k denotes the time index for the impulse response) in order to eliminate some fine structure which is of no interest and could only deteriorate the determination of local maxima. The above described comparative examples of the herein disclosed method can be readily implemented in existing handsfree sets. However, the signal processing has to be performed largely in real time, in particular, both the beamforming and the estimating of the impulse responses. The latter is due to the fact that the estimation of the impulses responses of the loudspeaker-room-microphone system (e.g. by echo compensation filtering means) is based on audio signals input in the at least one loudspeaker. The actual localization of the speaker can be performed at some arbitrary time by reading the estimated impulse responses (which are determined in real time and may be buffered) and, e.g., generating a direction - distance diagram or a differential direction - distance diagram based on the read impulse responses. Thus, according to the present invention it is provided a method that shows a lower demand for computational resources. According to claim 1 the method for localization of a speaker in a room in that at least one loudspeaker and at least one microphone array are located, comprises the steps of outputting sound by the at least one loudspeaker such that the sound is at least partly reflected by the person; detecting the sound output by the at least one loudspeaker and at least partly reflected by the person by the microphone array to obtain microphone signals for each of the M microphones constituting the microphone array; wherein estimating the impulse responses of the loudspeaker- room - microphone system for at least some of the M microphone signals and beamforming the estimated impulse responses for a number of predetermined directions to obtain L > M beamformed impulse responses; determining the energy response from the L beamformed impulse responses; and determining the speaker's direction towards and/or distance from the microphone array on the basis of the determined energy response. For M microphone signals only L beamformed signals (M < L) are obtained by applying the beamforming means to the impulse responses (see detailed description below) and the beamformed signals need not to be determined in real time but rather off-line, e.g., every few seconds only. Moreover, the beamforming can be restricted to some time interval of the entire impulse response h(k), where k denotes the time index for the impulse response. For example, beamforming may be restricted to k ∈ [kmin, kmax], in which, e.g., kmin is determined from the signal transit time for a sound wave that is radiated by the loudspeaker directly (without reflections) to the microphone array and kmax is determined from the signal transit time for a sound wave that is radiated by the loudspeaker to a distant wall of the acoustic room and reflected by this wall to the microphone array (maximum signal transit time). As in the comparative examples described above in which beamforming is performed before the estimation of the impulse responses, the energy responses from the beamformed impulse responses can be determined and used for generating a direction - distance diagram (or a differential direction - distance diagram) based on the determined energy responses. Subsequently, the speaker's direction towards and/or distance from the microphone array is determined on the basis of the thus generated direction - distance diagram, e.g., simply by determining the local maxima. It might be preferred to filter, in particular, by a bandpass filtering means, the estimated or beamformed impulse responses to obtain filtered impulse responses which are used for determining the respective energy responses. For instance, some frequency range of the impulse responses is extracted by bandpass filtering for which the employed beamforming means shows a high directionality. Thereby, the directional resolution can be increased and, in addition, the computational load can be reduced. In another embodiment a loudspeaker array is used to output sound that after reflection by a speaker who is to be localized is detected by the microphone array. Thus, according to this embodiment the sound is consecutively output by a loudspeaker array in a respective one of a number of predetermined directions; and the microphone array is consecutively steered to the respective one of the number of predetermined directions by a beamforming means, i.e. beamforming of the microphone signals to obtain a beamformed signal for the respective one of the predetermined directions is performed, the impulse responses of the loudspeaker- room - microphone system for at least some of the beamformed signals are estimated; and the speaker's direction towards and/or distance from the microphone array is determined on the basis of the estimated impulse responses. At each discrete time n (where n is the discrete time index of the microphone signals) one single direction is examined and spatial scanning is performed by simultaneous stirring both the loudspeaker array and the microphone array by respective beamforming means. By means of horizontal linear the loudspeaker and microphone arrays with centers arranged in the same vertical axis the acoustic room can efficiently be scanned in the horizontal direction. According to this embodiment, the beamformed microphone signals mainly contain directly arriving or reflected sound from the respective direction of origin of the sound and compared to the previous examples a smaller contribution of sound coming from other directions. Again, energy responses can be determined and a (differential) distance - direction diagram can be generated on the basis of the estimated impulse responses in order to localize the speaker. The sound (audio signal) output by the at least one loudspeaker according to one of the preceding examples, may be in an inaudible range, e.g., with a frequency above 20 kHz. Employment of inaudible sound is particularly preferred for the embodiments comprising a steerable loudspeaker array, since the control of different directions of output sound might give rise to an artificial listening experience in the case of music or voice output by the loudspeakers. The above-mentioned problem is also solved by a communication system according to claim 7. The present invention also provides a handsfree set, audio or video conference system, a speech control means or a speech recognition means comprising or being identical with a communication system according to one of the above-mentioned examples. In these devices and systems the inventive communication system can advantageously be incorporated. By accurate localization of a speaker the overall operation can significantly be improved. For instance, the microphone sensitivity and output volume of a handsfree set may be adjusted in dependence on the determined speaker's position. Moreover, reliable operation of a speech control means or a speech recognition means can be significantly improved by steering a microphone array towards the determined speaker's position thereby enhancing the quality of the detected speech signal representing a speaker's utterance. Furthermore, the present invention provides a computer program product comprising one or more computer readable media having computer-executable instructions for controlling and/or performing the steps of the examples od the herein disclosed method. Additional features and advantages of the present invention will be described with reference to the drawings. In the description, reference is made to the accompanying figures that are meant to illustrate comparative examples not being part of the present invention and preferred embodiments of the invention. It is understood that such embodiments do not represent the full scope of the invention. As illustrated in The individual microphones of the microphone array 3 output microphone signals representing the detected sound to a signal processing means 4. The signal processing means 4 comprises a beamforming means, e.g., a delay-and-sum beamformer or a filter-and-sum beamformer, for beamforming the microphone signals xm(n) where n is the discrete time index of the microphone signals. For instance, beamforming in L directions may be performed and the impulse responses of the loudspeaker-room-microphone system can be compared for each of the directions L. The impulse responses can be determined by echo compensation filtering means as known in the art. When the beamforming means is directed towards the loudspeaker 2 the acoustic signal output by the loudspeaker 2 is directly detected and the impulse response is high. When the beamforming means is directed towards the speaker 1 the impulse response represents the sound reflected by the speaker 1 towards the microphone array 3 thereby indicating the directional angle of the speaker 1. Moreover, if the position of the loudspeaker 2 is known, the distance of the speaker 1 from the microphone array 3 can be derived by detecting the time lag of the impulse response for the direction towards the speaker 1 with respect to the impulse response for the direction towards the loudspeaker 2. The time lag (difference in sound transit times) corresponds to the transit time of sound from the loudspeaker 2 to the speaker 1 and then to the microphone array 3. If, e.g., the loudspeaker 2 is located close to the microphone array 3, the distance of the speaker is derived from half of the transit time times the sound speed. In particular, the localization of the speaker based on the impulse responses hl(k, n), where k is the discrete time index within the time interval of the impulse response, may be carried out as follows. For each direction I = 1, .., L the energy responses are calculated pl(k, n) = |hl(k, n)|2 and then preferably smoothed over k in time. Subsequently, the smoothed energy responses can be combined (see An example not being part of the present invention is shown in The beamforming means 5 performs beamforming for L directions thereby scanning the acoustic room in which the loudspeaker 2, the speaker to be localized and the microphone array 3 are present. In this example, the beamforming means 5 is a delay-and-sum beamformer that delays the individual microphone signals xm(n) (where n is the discrete time index) from M = 3 microphones constituting the microphone array 3 such that phase balance is achieved for the respective direction of the directions L that is actually considered. The beamformed signal is represented by with weight factors am and delay parameters dm. For each of the L directions the impulse responses of the loudspeaker-room-microphone system are determined by an echo compensation filtering means 6. It is noted that such an echo compensation filtering means (see, e.g., The beamforming means 5, thus, outputs spatially filtered impulse responses. If necessary, some oversampling may be carried out for the microphone signals in order to achieve a higher angle resolution. Since usually L > M, less impulse responses are to be estimated as compared to the embodiment shown in In particular, the beamforming can be restricted to some relevant time interval of the impulse response [kmin, kmax], in which, e.g., kmin is determined from the signal transit time for a sound wave that is radiated by the loudspeaker 2 directly (without reflections) to the microphone array 3 and kmax is determined from the signal transit time for a sound wave that is radiated by the loudspeaker 2 to a distant wall of the acoustic room and reflected by the wall to the microphone array 3 (maximum signal transit time). By means of the calculated smoothed energy responses a two-dimensional direction - distance diagram is generated 150 similar to the one shown in In the differential direction - distance diagram local maxima of the smoothed energy responses are determined 170 in order to localize one or more speakers in terms of the distance from and the angular direction towards the microphone array. In All previously discussed embodiments are not intended as limitations but serve as examples illustrating features and advantages of the invention. It is to be understood that some or all of the above described features can also be combined in different ways. |