The following retrospective summaries are presented by Martin Rothenberg to fill in some contextual factors that may not be evident in reading the publications themselves. To emphasize that the viewpoints are that of the writer, the summaries are written in the first person and not in the third person usually used for technical papers. References can be found in the papers themselves, which are available at www.rothenberg.org.
Note: These contextual summaries may be augmented or revised from time-to-time, as indicated by the current revision date in the title
I. The Breath-Stream Dynamics of Simple-Released-Plosive Production.
Bibliotheca Phonetica VI, S. Karger AG, Basel, Switzerland (1968).
The monograph The Breath-Stream Dynamics of Simple-Released-Plosive Production
is a slightly edited version of my doctoral dissertation from the University
of Michigan (Ann Arbor) with the same title, published in 1966. BSD presents
a physiologically based dynamic model for a broad class of stop consonants occurring
in spoken natural languages, in particular simple released plosives. This model
was proposed as an alternative to acoustic models that were widely discussed
in the preceding 15 or 20 years, after the introduction of the sound spectrogram
(frequency vs. time plot, with intensity indicated by darkness or other visual
feature at each frequency-time coordinate) greatly simplified measuring acoustic
parameters. The trend toward acoustic modeling was also encouraged by research
in speech synthesis based on acoustic parameters. The model described in BSD
may be considered an extension of the physiological modeling proposed by Stetson
and others in the late 1930s and early 40s.
The acoustic models current in the early 1960s described stop consonants largely in terms of features visible on a sound spectrogram, such as presence or absence of a "voice bar" during a presumed occlusion interval and a "voice onset time" (duration of the interval from the acoustic impulse presumed to mark the release instant to the onset of quasi-periodic acoustic energy, if present).
The model presented in BSD, on the other hand, describes stop consonants in terms of physiological gestures. Mathematical models are described that relate these gestures to the aerodynamic patterns that determine the acoustic energy produced. It is assumed that for two stops to be phonemically distinct in a particular language dialect, they must consistently result in perceptually distinct acoustic patterns. However, it is also postulated that the most natural and explanatory characterization is in terms of the underlying gestures and not the resulting acoustic parameters. Thus, a physiological model should identify just those degrees of freedom of the speech production apparatus that are consistently producible and result in acoustic patterns perceptually distinct enough to be used linguistically. This also implies that a production differentiation allowed by the model should be learnable by almost all speakers of a language dialect.
Under the model, gestures are differentiated as being either unidirectional (from one state to another different state, as closed to open) or cyclic (returning to the approximate original state), since the aerodynamic consequences could be different for each class. According to the model, the primary gestures determining the classification of a particular production may occur at (1) the thoracic/abdominal level (to produce or modify the subglottal air pressure), (2) the laryngeal level (resulting primarily in an abduction or adduction of the vocal folds), (3) the velar level (to seal/unseal the velopharyngeal port in order to allow or prevent a buildup of air pressure above the vocal folds), and (4) the articulatory level (resulting in a momentary occlusion of the vocal tract at some point).
In the model presented, such gestures can be either ballistic (maximally fast, as limited by both tissue mass and muscle contraction times) or controlled. Wherever possible, the dynamic constraints determining the speed of ballistic unidirectional and cyclic gestures at each of the above levels were estimated from the physiological literature at that time or measured experimentally.
Importantly, the various gestures can vary in their timing with respect to one another to determine the phonetic category of the consonant produced. For example, an unvoiced stop produced in an intervocalic position is invariably produced with a cyclic opening (abductory) laryngeal gesture, from voiced to open (or breathy-voiced) to voiced. In English phonology, this laryngeal gesture is normally produced with a timing that synchronizes the initial phase of the glottal opening movement with the onset of contact phase of the closing articulatory gesture. Because of the dynamic constraints inherent in the cyclic laryngeal gesture, there is generally a period following the onset of the release during which the articulators are open but the glottis has not yet closed - the period of aspiration. However, if the articulatory gesture is made earlier with respect to the period of articulatory closure, the result is a "pre-aspirated" stop, such as those reported in dialects of Icelandic. If the articulatory gesture is made later with respect to the articulatory closure, the result is a "voiced-aspirated" stop, such as found in some languages spoken in India.
In addition to the four primary gestures described above, there are a number of other gestures that can modify the consonant produced sufficiently to be phonetically significant in some language dialects. Those additional gestures mentioned in BSD are a gesture of medial compression of the vocal folds and a gesture of downward vertical laryngeal movement. Some evidence is presented that a gesture of medial vocal fold compression, which would suppress vocal fold vibration in the period of occlusion, is used in certain stop consonants in Korean termed "unvoiced-tense" in the literature. Computations are also presented to show that a purposeful downward movement of the larynx could modify the supraglottal air pressure during the period of occlusion by a piston action, so as to significantly and perceptibly augment the strength and duration of voicing during the occlusion of a voiced stop. (A gesture raising the larynx is known to be used in certain languages to increase the supraglottal pressure in a class of stop consonants termed "ejectives". However, ejectives were not among the class of stop consonants considered in BSD.)
In summary, BSD explores the hypothesis that the range of simple released plosives
to be found in spoken natural languages can be explained and understood by characterizing
such consonants in terms of a set of closely synchronized physiological gestures.
II. The Glottal Volume Velocity Waveform during Loose and Tight Voiced Glottal Adjustments.
Proceedings of the Seventh International Congress of Phonetic Sciences, held at the University of Montreal and McGill University, 22-28 August 1971; edited by André Rigault and René Charbonneau, Mouton, The Hague - Paris. (1972)
This paper presented some of the first experimental results using the circumferentially vented (CV) pneumotachograph mask developed by the author for low distortion speech airflow measurements and the inverse filtering of oral airflow. In the monograph BSD (see above), the significance of laryngeal abductory gestures in speech production was explored in some detail. However, the role of laryngeal adductory gestures was treated only cursorily. In this paper, using the new CV mask, the author derives estimates for the dynamic constraints in both abductory and adductory gestures and presents examples illustrating how such gestures can be linguistically significant (independent of the stop consonant context assumed in BSD). In effect, the purpose of this paper was to use some of the new tools developed at the Syracuse University Speech Research laboratory to fill some of the gaps left open in BSD. [A more complete description of the theory and practice of the inverse filtering of oral airflow and measurements of its validity were submitted the same year to the Journal of the Acoustical Society of America and published in 1973. See III below.]
An important theoretical result that can be derived from the examples given in the paper is the conclusion that an adductory gesture can lead to at least three types of vocal fold vibratory behavior and their acoustic correlates, namely a cessation of vibration, a sharp reduction in the frequency of the vibration with possibly irregular intervals, or a bistable vibratory pattern, depending on factors not well under the speaker's control. In addition, an adductory gesture of minimal extent, as in rapid speech may only be evidenced by a reduction of the amplitude of spectral components at or near the fundamental frequency and not any of the above behaviors.
The often-noted occurrence of a cessation of vocal fold vibration (the first possible vocal fold behavior noted above) has led to the classification of an adductory gesture during voiced speech with the vocal tract not occluded as a "glottal stop", even though no actual cessation may occur in many instances. Likewise, though an abductory gesture, as in the English /h/, can often result in a cessation of voicing and acoustic turbulence, it is shown in the paper that a brief abductory gesture may be voiced throughout, with the identifying acoustic characteristic being a change in the spectral characteristics of the glottal airflow waveform. The invariant is not so much the acoustic result as the presence of the adductory or abductory gesture. These results can be taken as further support for the hypothesis presented in BSD that the natural linguistic classification of consonants should depend on the underlying gestures and not on the particular acoustic manifestation.
III. A New Inverse-Filtering Technique for Deriving the Glottal Airflow Waveform during Voicing.
J. Acoustical Soc. of Amer. 53, 1632-1645, June 1973.
This paper represented a radical change in the direction of my research from
the direction in BSD and the above paper. The change was from linguistic models
at the level of phonetics and phonology to a study of the human voice source.
The change resulted directly from my success in almost inadvertently developing
a new tool for voice research, the circumferentially vented (CV) mask (and a
little from my curiosity as to why I was such a poor singer).
The CV mask was originally developed for research on consonants; especially for recording the gross changes in airflow resulting from vocal fold abductory and adductory movements (see II above), including the post-release aspiration airflow. For these purposes, a response time of 2 or 3 ms would have been adequate, since the movements in question took place in at least 10 times that duration. After testing a number of approaches, it was decided to adapt the wire-screen pneumotachograph mask used for respiratory measurements. In addition to the response time requirements, the primary design criteria were low speech distortion and acceptably low back-pressure caused by the mask's flow resistance.
The standard respiratory masks having hard walls and flow measured at a single centrally located outlet, by acting as an acoustic extension of the vocal tract, strongly distorted the formant structure when used during speech or singing. (Though at the time this is written they are still marketed by at least one manufacturer for speech measurements. See the Glottal Enterprises website for sound samples from various masks.) However, after a series of improvements in mask design and careful measurements of speech distortion, response time and response linearity, it was found that a response time of as little as ½ ms could be attained, with a distortion and muffling of the speech that was acceptable for most speech research. (Later improvements brought the response time down closer to ¼ ms.)
The response time requirement was met by finding a differential pressure measurement system with a faster response time than the transducers used for respiratory work that would still be sensitive enough to measure pressures less than a tenth of the subglottal pressure (to keep mask back-pressure low). In addition, to keep the response time low, the sensors had to be coupled to the mask without the tubing used in respiratory applications. The reduction in speech distortion was accomplished by reconfiguring the mask so as to vent the mask via wire screen distributed around its circumference, much closer to the mouth, instead of at one centrally located outlet.
In making speech airflow measurements with these new masks, it became apparent that the masks were able to resolve at least coarsely the airflow variation within each glottal cycle, since the response time attained was usually less than 1/10 of the vocal fold vibratory period for both male and lower pitched female speaking voices. Thus the temptation arose to turn to looking at the variation of airflow within the glottal cycle, and from there it was only a short step to attempt inverse filtering the oral air flow to obtain the waveform of the airflow through the glottis.
Inverse filtering of a microphone signal had been performed previously and showed the approximate glottal waveform, but now we were able to add a flow scale and track how the waveform varied with the average airflow during an ab/adductory gesture. The waveforms showed that the waveform and resulting spectrum varied greatly during abduction. These results were used later during a stay at the Royal Institute of Technology, where, working with Rolf Carlson, Bjorn Granström and Jan Lindqvist-Gauffin, the first speech synthesizer to vary its source spectrum in a natural manner during an abductory or adductory gesture was implemented. This resulted in more natural unvoiced consonants and glottal stops. (Previous synthesizers included formant transitions and fading in of noise to simulate aspiration, but did not include the important transitions in source spectrum.) One other important (to us) outcome was that because an ab/adductory tremors such an important part of laughing, we developed the first speech synthesizer that could laugh! (The method we used is described in our paper A Three Parameter Model for the Glottal Source, in Speech Communication 2, Almqvist and Wiksell, Stockholm, 235-243, 1975.)
Extensive experience in inverse filtering also made clear that there was a characteristic shape to the glottal airflow waveform during non-breathy voicing that was independent of the shape of the glottal area waveform. This shape of a typical glottal airflow waveform was characterized by a gradual flow onset and a sharp cessation of flow that generated much more second and third formant energy than predicted by the then-current glottal flow models published by Flanagan. This shaping of the glottal flow pulse was discussed in the paper, and possibility of source-tract acoustic interaction included as a causative factor (in addition to adding first formant oscillations to the waveform as some of the formant energy was absorbed by the glottis during the open phase of the cycle).
Another novel feature of this paper is the method used to estimate subglottal pressure, developed after exploring the difficulties in implementing more intrusive methods. The aerodynamic model presented in BSD makes clear that if the glottis is not sealed and there is a complete supraglottal articulatory closure, the pressure behind the closure, the intraoral pressure, will rise to approximate the subglottal pressure in just a few milliseconds, at most. Thus if the consonant /p/ is pronounced between two vowels, to assure an open glottis, the intraoral pressure during the closure for the /p/ could be taken as a fairly accurate measure of subglottal pressure during the adjoining vowels. A sequence of repeated syllables /b/ vowel /p/ was used for the subglottal pressure measurements in the paper and the speaking rate kept high enough to prevent a purposeful variation of subglottal pressure with each syllable. (Though other researchers have subsequently used syllables /p/ vowel /p/, the syllable /b/ vowel /p/ is preferable because there is no decrease in subglottal pressure during the release, as would be caused by aspiration in the initial /p/.) This method is now commonly used for voice measurements.
A subsequent letter published in the Journal of Speech and Hearing Disorders
(No. 47, 218-224, 1982) cautions that for the method to be effective, a syllable
rate of at least 2 per second should be used in order to keep a relatively constant
subglottal pressure during each syllable by discouraging the use of a pulse
of subglottal pressure for each syllable and the system recording intraoral
pressure should have a response time of no more than 30 ms. There are also a
number of contexts that can be used to discourage the use of a pulse of subglottal
pressure for each syllable. One good method is to use a sequence of four or
five syllables, with a stress on the last syllable. The first and last syllables
are then eliminated from the measurements.
IV. (with S. Zahorian) Nonlinear Inverse Filtering Technique for Estimating the Glottal Area Waveform.
J. Acoust. Soc. of Amer., Vol. 61, pp. 1063-1071 (1977).
V. Acoustic Interaction Between the Glottal Source and the Vocal Tract. In Vocal Fold Physiology.
K.N. Stevens and M. Hirano, Eds., Univ. of Tokyo Press, 305-328 (1980).
The paper Nonlinear Inverse Filtering Technique for Estimating the Glottal
Area Waveform (NIF) was essentially an effort to bridge the gap between the
glottal airflow waveform, as obtained be standard inverse filter techniques,
and the waveform of the glottal area. This gap can consist of at least two factors,
namely, oscillations in the waveform at the frequency of the first formant and
a tilt to the right of the pulse of glottal airflow that results in a slowing
of the increase in airflow and a more abrupt closing phase. Though this asymmetry
in the pulse of glottal airflow had been long noted in the literature, it had
never been reported in glottal area waveforms. It follows from the principles
of Fourier analysis that this abrupt termination of the airflow pulse creates
most of the acoustic energy in the voice at the frequencies of the second and
higher formants. Thus it is most pronounced in acoustically strong voices. (A
related contributory factor is that in non-breathy voice the resulting pressure
pulse generated by the glottal closing is followed by a period of glottal closure,
or near closure, so that little of the energy produced is absorbed by the glottis,
especially if the closed quotient is high.)
An inverse filter has a transfer characteristic the inverse of that of the vocal tract with the glottis closed. This means that when the glottis is not closed the inverse filter is not tuned to the actual resonances of the vocal tract, and so there will not be a complete cancellation of the instantaneous formant energy. Thus, for example, during the open phase of the glottal cycle there can be some formant oscillation visible on the properly inverse filtered waveform. This makes sense acoustically, since this energy can be considered the supraglottal energy generated during the glottal closed phase being absorbed by the open glottis, and would undoubtedly be found at the glottis if one could actually measure the airflow at that point.
In NIF, an inverse filter is implemented in which the formant frequencies and damping coefficients are varied dynamically during the glottal cycle to roughly track the actual instantaneous vocal tract resonances. In this way it was hoped to obtain a waveform that more closely represents the glottal area. Some success was reported, but the resulting waveform, though less asymmetrical than the normal inverse filtered airflow, still showed an asymmetry, with a rapid closing phase that we considered not likely to be present in glottal area.
In what, in retrospect, may have been the most important contribution of NIF, the possibility was explored that the asymmetry of the glottal airflow pulse is caused by an acoustic interaction between the variation in glottal area and the inertance of the column of air in the vocal tract, especially that part of the tract closest to the glottis. This would mean that previously proposed differential equations relating area and airflow at the glottis were greatly in error, since this potential factor was not modeled. To support this hypothesis, a simple electrical analog was implemented consisting of a time-varying resistance (simulating the time-varying glottal flow resistance related to the instantaneous glottal area) and an inductor (simulating the vocal tract inertance). It was observed that the resulting electrical current (the glottal airflow) had a waveshape much more like glottal airflow pulses obtained by inverse filtering that we had observed over the years for many speakers in modal register.
However, the results of an electrical simulation would not be as satisfactory as a mathematical expression representing the solution to the differential equation characterizing the posited source-tract interaction, if such an expression can be found. The solution to a differential equation representing the source-tract acoustic interaction was presented at a conference in January of 1980 in a paper entitled Acoustic Interaction Between the Glottal Source and the Vocal Tract (AI) and published in the proceedings of the conference. The solution to the differential equation, when plotted for various values of vocal tract inertance, pulse duty cycle and simulated breathiness (incomplete vocal fold closure), exhibited even more closely than did the electrical simulation the characteristics observed inverse filtered airflow waveforms for numerous speakers in the modal or normal speaking voice register.
In both the NIF and AI papers, attempts were made to estimate the crucial parameter representing vocal tract inertance in the differential equation from approximate glottal and vocal tract dimensions. However, these efforts yielded values of inertance too low to fully explain the interaction seen in the waveforms of strong voices. In addition, this parameter was not easily estimated from acoustic measurements made at the lips, such as vocal tract formants. This conclusion is not difficult to visualize. The effective inertia of the flow in a channel stems from the velocity of the flow and therefore increases with a decrease in the diameter (and increase in length) of the channel. Thus the most obvious source of the additional inertance required to affect source-tract interaction to the degree seen in strong voices would be a constriction immediately above (or theoretically within or below) the glottis. A vocal tract constriction immediately above the glottis (or a component coming from a constricted air jet as airflow exits the glottis (see Note 1 below) would thus increase flow inertance seen by the glottis and therefore the source-tract interaction. However, a constriction immediately above a closed or almost closed glottis would have little effect on parameters of vocal tract acoustics that can be measured at the lips, such as the formant frequencies, since there is little oscillatory airflow near the high glottal impedance. (Constrictions closer to the lips, on the other hand, have a great effect on the formant frequencies.) Though there have been some simulation studies of the aerodynamics near the glottis, this important factor in voice quality, and how it varies between speakers, has not yet been adequately tied down.
Note; The dynamic properties of a jet of air near the glottis, that is, the reaction of the jet to a rapid change of airflow or pressure, is not considered in linear acoustics. Mathematical analysis of jets, eddies, turbulence and other nonlinear phenomena is much more complex than the analysis of linear acoustic waves used, for example, to compute formants.
It may be interesting to those familiar with automobile ignition systems that principle that a rapid change of airflow through an inertance can generate a strong pressure peak has an analog in ignition systems. The differential equation is similar to that proposed for the voice. To produce the high voltage peaks needed for the engine spark plugs, the flow of electrical current through an inductor (sometimes referred to as the "spark coil" or just the coil) is interrupted each time a high voltage pulse is needed by a spark plug. Until electronic ignition was introduced, this interruption was by means of a mechanical switching (separating the "points" in the distributor), though now the high voltage pulse generation is done with semiconductor electronics. In the vocal tract, the analogy to separating the points to stop the electrical current is the closing of the glottis to stop the airflow. Thus, a strong voice in the modal register can be thought of as creating 100 to 200 acoustic 'sparks' per second.
I consider the discovery of the principle of glottal source interaction with the inertive component of the vocal tract impedance, as detailed in NIF (with Steven Zahorian) and AI, to be my most important single contribution to voice research. It contradicted previous mathematical models of source-tract interaction and added an important component to concepts in linear modeling, which partially explained the acoustically strong modal or chest voice only in terms of resonances in the vocal tract, such as the so-called singers formant, and the duty cycle of the vibratory pattern, sometimes referred to in terms of a closed quotient or open quotient.
Though the tools developed in the process of discovery, such as the CV mask and airflow inverse filtering, may have a value in their own right, as for clinical or linguistic measurements, for me it was the theoretical advance they brought about that was of greatest value. After using these tools for many years, I came to recognize that there was a pattern in the inverse filtered waveforms that was relatively independent of the shape of the glottal area waveform. I looked for a system characteristic that explained this pattern. I found this patterning analogous to the well known exponentially damped sinusoidal response of a linear system to a perturbation of rather arbitrary waveform. In linear system theory or the study of linear differential equations, the exponentially damped sinusoids are referred to as comprising the natural response of the system. This analogy suggests that the theoretical waveforms shown in AI should be considered the natural response of the glottal-supraglottal system, when there is a glottal open period followed by a significant period of glottal closure and a high inertive component to the acoustic vocal tract impedance at the glottis.
VI. Measurement of Airflow in Speech.
J. of Speech and Hear. Res. 20, 155-166 (1977).
The paper Measurement of Airflow in Speech (MAS) documents certain advances
in the development of CV mask technology for measuring airflow in speech and
in applications to voice and speech research. Most significantly, it documents
that the response time of the CV mask had been reduced from the ½ ms
reported previously to approximately ¼ ms. When a CV mask is used to
estimate the glottal airflow waveform by inverse filtering, this reduction of
response time approximately doubles the range of F0 (voice fundamental frequency)
over which it can be used. Assuming that a good representation of the glottal
waveform requires a system response time of no more than 1/20 of the glottal
period, and a reasonable representation requires a response time of no more
than 1/10 of the period, a mask with a response time of ¼ ms could be
used for a good representation at values of F0 up to 200 Hz and at values of
F0 up to 400 Hz for a reasonable representation. [In later work with the soprano
singing voice, a smaller mask covering only the mouth was used to reduce the
response time even further, to increase the F0 range.]
One difficulty with CV mask design at higher of F0 was the necessity of recording the differential pressure across the wire screen. This is the same as saying that the sound pressure immediately outside the screen must be subtracted from the pressure inside the mask. It is shown in MAS that in the absence of a true differential pressure signal, a relatively simple correction can be made to the inside pressure to emulate the subtraction of outside pressure.
It was also shown theoretically and experimentally that in applications in which only a rough approximation of the shape of the glottal airflow waveform is required, and an open vowel is used, an appropriately designed low-pass filter can often substitute for an inverse filter tuned to the specific vocal tract resonances.
Among other applications explored in MAS, it was shown that the CV mask could
portray accurately the airflow pattern in the period of aspiration of an unvoiced
released stop, and, in conjunction with a simultaneous measurement of the intraoral
pressure waveform, the variation of the conductance of the articulatory constriction
as the constriction opens. This 'conductance' (using a linear system term advisedly)
describes quite well the patterning of separation during the aspiration interval.
[Note that this was the type of application that the CV mask was designed for
many years previously, before I moved to voice research.]
Also shown in the paper is that the mask signal yields a voice representation very resistant to ambient noise and from which F0 traces could be derived more accurately and reliably than from the radiated acoustic pressure (microphone) signal.
VII. The Voice Source in Singing.
in Research Aspects of Singing, publications issued by the Royal Swedish Academy of Music, no. 13, 13-33 (1981).
The Voice Source in Singing (VSS) is one of a number of papers published during and just after a year I spent at the Speech Transmission Laboratory at the Swedish Royal Institute of Technology, where, with a number of colleagues there, I explored the ramifications of the new view of the voice source in speech and singing afforded by the interactive model of the voice presented in previous papers. VSS was a tutorial presentation of the new theory in which the implications for the professional singing voice were explored.
VIII. An Interactive Model for the Voice Source.
In Vocal Fold Physiology: Contemporary Research and Clinical Issues, D.M. Bless and J.H. Abbs, eds., College Hill Press, San Diego, 155-165 (1983). (Proceedings of the Vocal Fold Physiology Conference, Univ. of Wisconsin -Madison, May 31-June 4, 1981.)
This paper (IMVS) explores a number of issues concerning the previously proposed interactive source-tract model, such as where in the vocal tract the inertive component of the vocal tract impedance must be to explain the glottal flow waveforms obtained by inverse filtering, the effect on the glottal flow waveform of the glottal conductance being flow dependent (and not dependent only on glottal dimensions) over some portion of the vocal fold vibratory cycle, and certain variations in the vibratory pattern of the vocal folds. To provide a basis for investigating the important question of why some individuals are blessed with an acoustically strong voice, and others (including me) are not, a parametric model is proposed that relates the vocal fold vibratory pattern to an inertive component of the vocal tract impedance at the larynx. As far as the answer to the question of the source of the strong voice, my best guess was, and still is, that the stronger than average source-tract interaction that is associated with a strong voice stems from a higher than average acoustic inertance component immediately above the glottis, combined with a more abrupt glottal closing phase. The more abrupt glottal closing phase appears to be associated with a parallel vocal fold geometry during the closing phase, as compared to a more gradual or 'zipper-like' closing. This conclusion has been reinforced for me by a comparison of electroglottograph waveforms for speakers with weaker and stronger voices. (I began to use the electroglottograph shortly before my stay in Sweden, using the version designed by Adrian Fourcin, which reduced the noise enough compared to previous units to allow reliable recording from most subjects. After my stay in Sweden, the use of electroglottography to augment airflow inverse filtering became a regular part of our research protocol.)
IX. Source-Tract Acoustic Interaction and Voice Quality.
in the Transcripts of the Twelfth Symposium: Care of the Professional Voice, The Julliard School, New York City, June 6-10, 1983.
In view of the potential importance of the postulated interaction of the pattern of variation in glottal conductance in voicing with an inertive component of the vocal tract impedance, it was decided to attempt to validate the theory by varying the vocal tract inertance and observing the inverse filtered airflow to look for a resultant change in the flow waveform. A Helium-Oxygen mixture was used to reduce the inertive component of the vocal tract impedance. The resulting waveform changes were those that would be predicted by the theory.
X. Source-Tract Acoustic Interaction in Breathy Voice.
In Vocal Fold Physiology: Biomechanics, Acoustics and Phonatory Control, I.R. Titze and R.C.Scherer, eds. The Denver Center for the Performing Arts, Denver, CO, 465-481 (1984).
This paper (BV) would turn out to be my last in the series exploring the acoustic interaction between the glottal valving of the breath-stream and the inertance of the vocal tract airflow in or near the glottis. The research reported set out to study how this interaction affects breathy voice (vocalization with partially abducted vocal folds), but lead me to unexpected results concerning models for source-tract interaction and ways for measuring the parameters of such models. The research stemmed from theoretical and experimental work performed in Stockholm in 1980, especially a chart recording showing simultaneous traces of inverse filtered glottal airflow, EGG, and a photoglottograph during a sentence containing a segment of breathy voice and vowels easy to inverse filter. (My recollection is that this recording was made in cooperation with Peter Kitzing of the University of Lund, who was kind enough to introduce me to the art of photoglottography.) The photoglottograph provides a trace roughly proportional to glottal area from light passed through the glottis.
The general conclusion for breathy voicing is that the acoustic inertance near the glottis acts as a low pass filter for the glottal source function, reducing the proportion of energy at the second and higher formant frequencies, as compared to the increase in the proportion of higher formant energy in non-breathy voicing. This effect would be perceptually important, since it would emphasize the transitions in source spectra in voice-to-breathy-voiced (or breathy voiced to voiced) transitions, and should be considered a potential factor in speech recognition and high quality speech synthesis. Such transitions occur whenever unvoiced consonants adjoin a voiced phoneme.
A second conclusion in BV is that simultaneous inverse filtered airflow and EGG waveforms during breathy voice can be used to work backward from measures of the low pass filter effect of the source-tract interaction to an estimate of the value of acoustic inertance that would cause that interaction. This procedure could be a potentially important tool for exploring the sources of the strong difference between an acoustically strong voice and an acoustically weak voice. Its importance stems from the fact that acoustic inertance appears to originate in a location in the pharynx difficult to access for direct measurements. However, to my knowledge voice research has not exploited this tool subsequent to the publication of BV.
Finally, establishing a quantitative relationship between the low-pass effect
described above and the value of vocal tract inertance required a rather elaborate
mathematical exploration of the various proposed relationships between glottal
airflow and transglottal pressure. A simple linear pressure-flow relationship,
resulting in a well-defined glottal conductance that varied with glottal area
was used in the my original papers to establish the principle of inertance-related
source-tract interaction. However, life in the glottis is not that simple. Fant
and van den Berg have both proposed flow-dependent (or pressure-dependent) glottal
conductance models that yield somewhat different degrees of inertance-related
source-tract interaction than the simple flow independent model. These various
models and their implications for the study of breathy voice are discussed at
length in the paper.
XI. Cosi' Fan Tutte.
in Vocal Fold Physiology: Laryngeal Function in Phonation and Respiration, T. Baer, C. Sasaki, and K.S. Harris, eds., College Hill Press, San Diego, 254-263 (1986).
The title Cosi' Fan Tutte (CFT) was an attempt to humorously convey the meaning "the way the women do it". (If you don't like the title, you can put the blame on Donald Miller, who suggested it.) In the early 1980s, there were a number of professional singers who were regular visitors to the Speech Research Laboratory at Syracuse University. They were trying, with me, to unlock some of the secrets of the good singing voice. We worked primarily on the male voice because of the F0 limitations of the CV mask/inverse filtering system developed at the laboratory.
The sopranos working at the laboratory came to take this as a sign of sexism, albeit inadvertent, and not as an insuperable technical limitation. I was pressed hard by them to balance our research with some work on the female voice. Eventually, I gave in and thought of a straightforward but significant problem that might be approachable. How can a soprano with a very good voice and lots of training produce the ear piercing and lengthy notes, rich in harmonic structure, near the top of her range, and do this without damaging her vocal cords or running out of breath? If it was simply a matter of tuning the vocal tract to one or another harmonic, the note would tend to be sinusoidal and not rich in harmonics (and thus not clearly convey a vowel quality), and the breath conservation and vocal cord damage issues would remain.
Fortunately we had among us, in Dolores Leffingwell, a talented soprano who could readily produce notes of this type and was willing to be a subject. To extend the range of the CV mask to the top of Dolores' range, I designed a small mask fitting over only her mouth. When using this mask at lower F0 values at which inverse filtering is relatively easy, the improvement in the details in the resulting waveform indicated that the response time of the mask and transducer was down significantly from the ¼ ms that was verified for a larger mask, though no formal measurements of response time were made. However, the problem remained of how to adjust the parameters of the inverse filter at higher pitches, namely those at which the value of F0 approached the expected value of the first formant. At these pitch levels, even a slight variation in the first formant frequency or damping setting would produce a great change in the waveform, and the normal procedure of eliminating or minimizing formant frequency oscillations occurring during the glottal closed phase did not work, since the closed phase would not contain even a single oscillation at the frequency of the first formant.
To solve this problem, we employed an electroglottograph. The mask signal and simultaneously recorded EGG signal were stored on a two channel recording system and replayed repetitively, properly time synchronized. The EGG signal was able to identify unambiguously the glottal closed period. After considerable adjustment of the inverse filter parameters, it was determined that there was only one setting for F1 frequency and damping at which there was a flat period in the waveform near zero airflow that closely coincided with the closed period indicated by the EGG. So with the help of the EGG signal we were able to see the glottal flow waveform!
The resulting waveform was surprising to us at first. It had two peaks and a null near the center of the glottal open period. The interpretation I made was that when the first formant was closely tuned to the F0 and there was a period of glottal closure that was about half the total period, the pressure wave generated by the previous glottal pulse and returning from the lips strongly suppressed the glottal airflow, creating the dip in airflow that we observed.
The result of this pattern was (1) a greatly reduced average airflow, with a resultant reduced drying of the mucous membranes of the vocal folds and increased conservation of lung volume expended, and (2) a glottal waveform with a rich overtone content that would add energy at higher formants and support judgments of vowel quality.
Both of these factors are potentially very important in singing. Consider point number 2. This result says that with the proper vocal fold vibratory pattern, the tuning of F1 to F0 can greatly increase energy at harmonics well above F0 and reduce the F0 component of the glottal wave. This result is not predicted by mathematical models of the singing voice in which source and vocal tract are considered to function separately.
I found point number 1 above easy to accept, since it agreed with another aspect of my experience in technology, similar to the way in which the male type of source-tract interaction in the strong voice was analogous to an automobile ignition system. The potential soprano source-tract interaction at high pitches is apparently analogous to the final amplification stage of a radio transmitter. Just as Dolores' vocal folds were open for only part of the glottal cycle, the final amplifying stage supplies current to the antenna circuit for only part of each cycle at the transmitter's carrier frequency. It is well known in communications technology that under these conditions, when the antenna circuit is tuned accurately to the transmission frequency, there is a drop in the power (electrical current) drawn from the power supply. In fact, when helping maintain a radio relay station on a mountaintop during my military service, I would periodically check for the proper tuning of each transmitter's antenna circuit by observing a meter measuring the electrical current taken by the final amplification stage, and adjusting the antenna circuit to minimize that current. I couldn't help wondering whether the highly trained soprano does the same thing, that is, develop a feel for the amount of breath flow used and adjust the articulators to minimize that flow. The paper also speculates that performance by a soprano at times in which the vocal fold physiology is not providing the correct conditions for this airflow reduction to occur may be harmful to the vocal fold tissues.
XII. The Control Of Airflow During Loud Soprano Singing.
Journal of Voice, Vol. 1 No. 3, 338-351 (1988).
The paper The Control Of Airflow During Loud Soprano Singing (COA), written with colleagues in the Speech Research Laboratory, Donald Miller, Richard Molitor and Dolores Leffingwell, can be seen as returning to the line of research of my doctoral dissertation (BSD, see I. above) after over 20 years in other areas, primarily voice research. (In the research for this paper, the CV mask was used for the purpose for which it was originally intended, the measurement of airflow during consonants.) The problem we considered stemmed from the fact that the subglottal air pressure used by a professional singer (we considered sopranos in this study) was known to reach values four or five times the pressure used in normal speaking. The previous paper (CFT) explored one mechanism for conserving breath volume during the vowel segments with air pressures this high. However, were there also mechanisms for conserving the breath volume during unvoiced consonants occurring between the vowels in the piece being sung? In most such consonants, as pronounced during speech, there is a period during which the vocal tract and glottis are open and the airflow increases.
It was affirmed in my dissertation (BSD) that the response times in the postural muscles controlling subglottal pressure were among the slowest in the body, and therefore it is not likely that the pressure could be reduced abruptly for the consonant and increased abruptly for a succeeding vowel. There are likely to be other compensatory mechanisms that must be learned by the professional singer. In COA we indeed found such mechanisms, and I will leave it to the reader to go the paper itself to explore what we found.
To my knowledge, there have been no follow-ups to this study nor research on the implications for voice pedagogy. Perhaps this summary will help stimulate such activity.