|
|||||||||||||||||||
Regular Articles Vol. 23, No. 1, pp. 50–55, Jan. 2025. https://doi.org/10.53829/ntr202501ra1 AI-powered Beamforming for Listening to Moving TalkersAbstractSpeech enhancement extracts the voice we want to listen to from the background noise and is essential for speech applications, such as automatic speech recognition, to work effectively in noisy environments. This article introduces a novel beamforming technique that tracks the speaker’s movement and keeps extracting the target speaker’s voice, even when the speaker is moving while talking. Beamforming requires spatial information of the target source and interfering signals such as the direction of arrival. We discuss our previously proposed method to estimate time-varying spatial information by incorporating powerful artificial intelligence technology. This method enables high-performance beamforming even when the target speaker is moving. Keywords: beamforming, artificial intelligence, moving sources 1. IntroductionSpeech processing technology has greatly progressed, and speech interfaces, such as smart speakers, have become widely used in our daily lives. We are constantly immersed in many types of sounds, including ambient noise and interfering speakers. Such interfering signals can degrade the performance of speech processing application such as automatic speech recognition (ASR). To mitigate such degradation, speech interfaces often use multiple microphones (microphone array) and apply array signal processing techniques to suppress the interfering signals and enhance the target sources’ signals. Beamforming [1] (also called spatial filtering) has been an active research field for several decades and extensively used to design speech enhancement systems for hearing aids [2] and speech interfaces [3]. Beamforming exploits the spatial information about the target and interfering sources and emphasizes the signals coming from a target source direction while suppressing the interfering signals coming from other directions. It plays an important role in developing far-field speech processing application and has been used as the de facto standard front-end in meeting analysis (i.e., far-field speaker diarization and ASR) challenges [4]. Considering realistic situations, such as application to hearing aids or smart speakers, the target and interfering sources may move, e.g., the talkers walk around the room while speaking (see Fig. 1). However, most conventional studies assume a static situation in which the sources do not move within an utterance. Applying such conventional beamformers to dynamic situations in which the sources move results in sub-optimal performance.
To construct effective beamforming filters, accurate estimation of the spatial information (i.e., information on the direction of the target and interfering sources) is essential. However, the spatial information changes at every moment in dynamic situations; thus, the estimation problem of such time-varying spatial information becomes more difficult than that in static situations. We introduce our recent study [5], in which we proposed an estimation method of the spatial information of moving sources by incorporating powerful artificial intelligence (AI) technology. By using developments in the AI field, i.e., attention mechanism [6], our method allows the beamforming to steer its directivity at each time frame toward the position of the moving source, i.e., enabling source tracking. 2. Overview of beamformingIn speech processing, beamforming filters are often designed in the short-time Fourier transform (STFT) domain. Let Yt,ƒ ∈ ℂC be a vector comprising the C-channel STFT coefficients of the observed signal at a time-frequency bin (t,ƒ), which is recorded using a C-channel microphone array and potentially contaminated with ambient noise and interfering speakers. The objective of beamforming is to recover (enhance) the target source signal from such multichannel noisy observation. Given the observed noisy signal Yt,ƒ, beamforming estimates the enhanced signal St,ƒ ∈ ℂ by linearly filtering the observed signal Yt,ƒ with beamforming filter wt,ƒ ∈ ℂC as where and † denotes the conjugate transpose. Here, FBF(⋅) denotes the beamforming function [1], which indicates that the beamforming filter wt,ƒ is constructed on the basis of the spatial information (i.e., spatial covariance matrices) of the target and interfering signals respectively. From Eq. (1), the problem of constructing effective beamforming filters is considered the problem of estimating accurate spatial information. 3. AI-powered neural beamforming for moving sourcesBeamforming relies on spatial statistics of the target source and interfering signals, which is typically computed by considering the entire observed signal, e.g., 10 seconds long. However, if the source moves, considering the entire signal results in a beamforming filter that does not steer in the correct source direction. We propose instead to estimate the spatial statistics using only the observation that are relevant at a given time, which allows designing a beamforming filter that can track the source movement. Finding which part of observation to use at a given time is challenging because it depends on the way the source moves (e.g., velocity and trajectory). Thus, we design a powerful AI model to estimate which observation to use to obtain optimal beamforming filters. In our previous study [5], we found that the conventional estimation method of spatial information can be expressed using the following general formulation: where ν = {S, I}, T denotes the number of time frames, and denotes instantaneous spatial information, which includes spatial information for each time frame. By accumulating instantaneous statistics with weight coefficients , we can obtain an estimate of spatial information that is statistically more reliable. The weight coefficients indicate the importance of each frame in estimating spatial information at a given time frame t. Equation (2) is analogous to that of the attention mechanism [6], which is the key component of the success of AI technology. Thus, we refer to these weight coefficients as attention weights. In previous studies, the attention weights for beamforming were determined by heuristic and deterministic rules such as in an online [7] or blockwise [8] manner. However, such simple rules may not be necessarily optimal for handling a variety of moving patterns such as different trajectories/velocities of sources and different ambient noise conditions. To estimate the optimal attention weights in various situations, we thus proposed a neural-network-based attention weight estimation model , where , F denotes the number of frequency bins, and NN(⋅) denotes a neural-network-based function, i.e., self-attention model [6]. On the basis of the supervised learning (data-driven) framework, we train the self-attention model to estimate the attention weights that can extract the target source signals while removing the interfering signals. By incorporating a variety of moving patterns in the training set, the training procedure enables the model to estimate the attention weights that are optimal for tracking the positions of moving sources in various situations. Figure 2 illustrates the behavior of the attention weights as the target speaker moves. The green and red boxes show examples of attention weights when the target speaker is talking from a direction corresponding to around about 45 and 135 degrees, respectively. The attention weights take high values for time frames where the target speaker is close to the current direction. As seen in the figure, the attention weights change when the target speaker moves, which enables us to track the speaker and estimate reliable spatial information at each time frame.
Figure 3 shows the processing flow of our attention-based beamforming method. First, given the observed noisy signal Yt,ƒ, the instantaneous spatial information is computed. Next, the attention weight estimation model computes the attention weights Αν. Then, the spatial information is computed on the basis of the instantaneous spatial information and attention weights by using Eq. (2). Finally, the constructed beamforming filters wt,ƒ are applied to the observed signal, and the enhanced signal St,ƒ is generated using Eq. (1).
4. Performance evaluationWe conducted experiments to evaluate the speech enhancement performance of our method. We created an evaluation dataset consisting of simulated moving sources under noisy conditions. For each recording, a single talker is speaking while moving, and the microphone observations are contaminated with ambient noise in public noisy environments such as cafe, street junction, public transportation, and pedestrian area. We used a rectangular microphone array (on a tablet device) with five channels. Figure 4 shows the objective speech enhancement measures (i.e., signal-to-distortion ratio (SDR) [9]) of the evaluated signals: 1) unprocessed signals, 2) beamformed signals (conventional), and 3) beamformed signals (ours). We observed that our method significantly improved speech enhancement performance compared with the conventional method.
Figure 5 shows an example of the evaluated signals recorded in a real-world environment such as that in Fig. 1. Focusing on the latter part of the waveform of the conventional method, the amplitude of the beamformed signal becomes smaller, which suggests the failure of source tracking (i.e., the beamformer’s directivity is not steered toward the position of the moving source). Our method preserves the amplitude of the target source signal for the entire utterance, which shows the success of source tracking.
5. Future perspectivesOur attention-based beamforming method can automatically track a speaker’s movement, enabling a beamforming method to keep listening while a talker is moving. Our method opens new possibilities for speech interfaces, hearing aids, and robots, which could lead to a future in which people and machines can interact more naturally in any situation such as when multiple speakers talk while freely walking around. We built a demo system using our method and confirmed that it can work for real-world recordings. However, issues need to be addressed before the method can be widely used. For example, our current method incurs high computation costs when using the attention weight estimation model. Thus, an investigation for low-latency processing would be required to allow quick response. In addition, the training stage of our method requires simulating microphone observations of moving sources. Improving the reality of these simulations may be important to further improve performance and make the method more robust to various recording conditions. References
|