Toward Understanding Mechanisms of Sensorimotor Processing in Speech Production
We report on the online mechanisms of sensory feedback-based speech motor control. The neural processing mechanisms governing human speech production skills need to be understood in order to design a user-friendly and easy-to-use remote speech communication system. We elucidated the contributions of somatosensory and auditory feedback through experimental observation of involuntary responses induced by perturbed sensory feedback information during speech production tasks. The experimental results provide evidence that both the somatosensory feedback and auditory feedback associated with self speech production strongly affect the temporal dynamics of articulatory movements.
In humans, speech production is a highly skilled aspect of muscular control and is gradually developed as a child matures. During the development, several different sorts of sensory feedback information play important roles in monitoring how well the action is being organized while various types of phonemes are produced. The sources for monitoring articulatory movements consist of cutaneous/somatosensory information related to the status of respiratory, laryngeal, velopharyngeal, and articulatory subsystems and auditory information representing the characteristics of the associated acoustic output.
1.1 Cutaneous/somatosensory feedback in speech production
The role of cutaneous and/or somatosensory feedback for speech motor control has been investigated in a series of studies – on compensatory articulatory movements of the upper lip induced by a perturbation of the jaw or lower lip during the production of the bilabial explosive consonants (/p/ and /b/). The compensatory movements act effectively to achieve the intended acoustic sounds against unpredictable perturbation. One might speculate that an active compensatory mechanism recruited by somatosensory feedback contributes to the generation of these compensatory movements because the corresponding electromyographic (EMG) activity of the primary upper-lip muscle (orbicularis oris superior (OOS)) increased. However, the time delay due to nerve conduction and mechanochemical dynamics might present a problem for explanations of the rapid regulation of fast speech movements by sensorimotor coordination.
We have found that during the production of bilabial fricative consonant /F/, which requires precise control of the aperture between the upper and lower lips, the upper lip shifts downward rapidly in response to a sudden jaw-lowering perturbation to maintain the intact labial aperture . Although the initial phase of the upper lip shift is generated by the mechanical linkage of perioral dynamics , the later phase will be partly regulated by the reactive muscle response. Actually, the upper lip muscle activity started to increase 48.25±1.2 ms after the jaw perturbation. This latency is longer than that in a perioral reflex (14–17 ms ) that is mediated within the brainstem alone and shorter than the jaw’s voluntary reaction time after a stimulus is perceived (150±13 ms ). Considering these findings, it can be postulated that cortical processing is involved in this reflexive compensatory adjustment of speech articulation, as examined in the long latency stretch reflex . We assessed the effect of transcranial magnetic stimulation (TMS) over the motor cortex on the reflexive compensatory adjustments of speech articulation . Note that, in some cases, TMS can transiently disrupt or suspend cortical neural processing , , but in other cases it can enhance cortical neural excitability . Our experimental results are presented in 2.1.
1.2 Auditory feedback in speech production
With regard to the role of auditory feedback for speech motor control, it is well known that speaking with exposure to a delayed auditory feedback (DAF) leads to various types of speech disfluencies, e.g., increased articulatory error, lengthened duration, augmented volume, and increased fundamental frequency –. Such disfluencies may occur as a result of several different types of voluntary and involuntary responses to DAF. The Lombard effect (or Lombard reflex) is well cited as an example of auditory-induced automatic motor response to a change in background noise level, where speakers involuntarily increase their vocal intensity as the noise level increases , . Although these sorts of reflexive mechanisms can be considered as potential sources for such DAF-influenced speech disfluencies, the precise mechanisms of how the delayed auditory input of self-produced speech can adversely affect the speech motor control has not been fully elucidated yet.
Various studies using auditory feedback alteration have suggested that acoustic information is critical for learning and maintaining vowel production ,  and voice pitch control , . Evidence has also been obtained from humans and non-human primates showing that neural activity in the auditory cortex is modulated by self-produced vocalization –. However, there is an ongoing debate about whether such neural mechanisms also help ensure stability in rapid and complex speech motor control , . Auditory feedback may serve as an immediate source for the dynamic control of speech articulation, analogous to the rapid adjustment of labial constriction based on cutaneous and/or somatosensory information. We examined the online control mechanism for articulatory lip movement during the repetition of bilabial plosives /pa/ by suddenly shifting the auditory feedback timing in the ahead-of-time or delayed direction and/or by replacing the feedback syllable by other syllables . Our experimental results are presented in 2.2.
2. Mechanisms of sensory feedback control: perturbation studies
2.1 Involvement of the motor cortex in reflexive speech motor coordination
We examined the facilitatory effect of TMS on the reflexive compensatory response to jaw perturbation during the production of bilabial fricative consonant /F/. The subject’s jaw was held in a jaw perturbation system by clamping it between a chin plate and a custom-built splint that was attached to the teeth (Fig. 1). Note that this apparatus resulted in little disruption of normal speech and could apply a slight force in the jaw opening/closing direction. To avoid anticipation, a step-wise jaw-opening perturbation (3.0 N) was applied in 20% of the trials. Each session (100 trials) included the following conditions: PT (perturbation with TMS, 10 trials), PN (perturbation alone, 10 trials), and NT (TMS alone, 10 trials), and control (70 trials). There were three subjects denoted A–C. The jaw perturbation elicited a quick downward shift of the upper lip, accompanied by a muscle EMG response (black line in Fig. 2(a)) that served to maintain the labial aperture for producing /F/. The question here is what neural mechanism generates the EMG response. We compared EMG response latencies caused by the jaw perturbation with those involved in the voluntary reaction. A typical EMG response for the OOS in the reaction task is shown in Fig. 2(b). The voluntary response started at around 300 ms after the perturbation. The mean reaction time for the three subjects (315.7±98.4 SD ms; SD: standard deviation) was obviously longer than the latency of the reflexive compensatory response (48.25±1.2 ms ), suggesting that the short-latency (< 100 ms) compensatory response was generated involuntarily.
We applied TMS in order to examine the involvement of the motor cortex in generating the reflexive compensatory response. We expected a TMS over the motor cortex to enhance the EMG activity of the response if the lip region of the motor cortex is involved in the reflexive compensatory response of the upper lip. The typical EMG pattern observed when TMS was applied during jaw perturbation (PT) is depicted by the red line in Fig. 2(a). The first sharp peak 75 ms after perturbation onset was an artifact induced by current spread due to TMS. Compared with the response without TMS (PN), an increase in EMG activity started 10 ms after TMS onset and continued for roughly 10 ms (shaded area). To quantify the amplitude of muscle response, the rectified EMG signal during a 10-ms window (10–20 ms after TMS onset) was temporally averaged and pooled in each condition (PT, PN, and NT). The background muscle activity level (BK) was quantified by temporally averaging the rectified EMG signals for 10–20 ms prior to the perturbation (or stimulus) onset. The response amplitudes in all cases (PT, PN, BK, and NT) are summarized in Fig. 2(c). TMS consistently enhanced the reflexive compensatory response in all subjects, as shown by the difference between the PT and PN cases (statistically significant), whereas there was no significant enhancement of muscle activity in NT compared with BK for subjects A and B. In the NT case for subject C, the muscle activity was slightly enhanced. The enhanced EMG activity in PT, however, was considerably higher than that in NT (difference between (PT minus PN) and (NT minus BK)), suggesting that the facilitatory effect was the primary determinant of the enhanced EMG activity in PT. In summary, these facilitations suggested that the cortical pathway contributes significantly to the production of the reflexive compensatory response.
2.2 Auditory-induced rapid change in articulatory lip movement
We evaluated changes in articulatory lip movement that occurred when a sudden alteration in auditory feedback timing and context was introduced while a subject was speaking the plosive-initial syllable /pa/ repetitively. A schematic diagram of the auditory feedback alteration system is shown in Fig. 3. The speech sounds produced by the subject were processed by a custom program running on a computer designed to alter the input speech signals. The altered signals were mixed with background noise and fed back to the subject’s ears via earphones. Background noise can prevent subjects from hearing their own speech sounds while they are speaking. The subjects were asked to produce an isolated syllable /pa/ seven times while maintaining a constant speech rate. For each trial, the auditory feedback corresponding to the third repetition of /pa/ was altered by shifting the timing and/or replacing the type of syllable, while the subsequent feedback was omitted. Pre-recorded sounds /pa/, /Fa/, and /pi/, spoken by the same subject, were used for the syllable replacement. The timing shift was -150, -100, -50, 0, +50, +100, or +150 ms from the third repetition onset, which was predicted from the interval between the onset of the first and second syllables in each trial. The three-dimensional motion of markers placed on the upper and lower lips was measured with an optical motion capture system, from which the aperture between the upper and lower lips (labial distance (LD)) was obtained. Typical LD trajectory data obtained during the production of /pa/ at a speech rate of 300 ms per syllable is shown in Fig. 4. The red curve in each panel shows the mean LD trajectory for five trials over the test blocks. The mean trajectory for ten trials in the control (normal feedback condition) block is shown by a black curve. The LD trajectories in all conditions were temporally aligned at the predicted third syllable onset by referring to the simultaneously recorded acoustic signals. The auditory feedback conditions shown from the top to bottom panels were as follows: pre-recorded /pa/ was presented at -150, -100, -50, 0, 50, 100, and 150 ms from the predicted third syllable onset. The solid vertical line in each panel indicates the onset timing of the auditory stimulus, while the dotted vertical line indicates the predicted third syllable onset. By comparing the two trajectories in each panel, we found that the mouth opening movement subsequent to the auditory stimulus onset quickened by the -50-ms stimulus presentation. While a similar hasty movement was also observed for the -150- and -100-ms conditions, the effect seemed to be weaker. The deviation between the trajectories under each of the delayed feedback (50, 100, 150 ms) and control (0 ms) conditions was much smaller. Similar results were obtained for all ten subjects. The lags (N = 10) corresponding to the maximum cross-correlation between LD trajectories under the altered and control conditions within the post-stimulus period (dark shaded areas in Fig. 4), obtained by subtracting those within the pre-stimulus period (light shaded areas in Fig. 4) are shown in Fig. 5. Each pre- and post-stimulus period corresponds to a single cycle of lip closing/opening movement. A negative lag value reflects an ahead-of-time shift of the movement compared with the control. The condition indicated as normal refers to a comparison of the normal feedback trials during the test blocks and those in the control block, which reflects the variance in each subject’s baseline speech rate throughout the experiment. The statistical significance of the difference from the normal condition (p<0.05) was evaluated with a two-sided paired t-test (no. of degrees of freedom = 9 for all comparisons, with the Bonferroni adjustment). An ahead-of-time shift in the movement was found only when the auditory feedback of /pa/ preceded the real syllable production by 50 ms. An excessively early manipulation (-150 and -100 ms) of the auditory feedback did not significantly affect the movement. The delayed feedback (+50, 100, and 150 ms) also produced no significant change. Syllables that were not identical to those of the speech task (/Fa/ and /pi/) had no significant effect even when they were fed back 50 ms prior to the real syllable production. These results indicate that the ahead-of-time and delayed auditory feedback affected the articulatory lip movement in a time-asymmetric and context-specific manner during repetitive syllable production. These findings suggest the existence of a compensatory mechanism to maintain a constant speech rate by detecting errors between the internally predicted and actually provided auditory information associated with self movement. The timing- and context-dependent effects of feedback alteration suggest that the sensory error detection works in a temporally asymmetric window where acoustic features of the syllable to be produced may be coded.
We investigated the contribution of sensory feedback to speech motor control by observing the involuntary response induced by perturbed somatosensory and auditory information. TMS to the cortex was demonstrated to have facilitatory effects on the reflexive compensatory response in lip muscles during labial speech production, and this led us to suggest that its generation involves the primary motor cortex. High-level computation in the cortex would greatly contribute to the organization of complex sensorimotor coordination among articulatory organs in order to achieve robustness in speech tasks. The articulatory lip movement quickened immediately when the auditory feedback preceded the expected timing by 50 ms. Such an articulatory change was not observed when the feedback was presented more than 50 ms earlier or later than the actual timing or when the feedback syllable was replaced by another syllable. These results suggest that errors between the internally predicted and actually provided auditory information detected in a temporally asymmetric window contribute to the compensation for the inter-articulatory timing in the syllable repetition task. Our studies provide evidence that the temporal dynamics of articulatory jaw and lip movements must be correctly maintained with both somatosensory and auditory feedback resulting from self speech production.