image/svg+xml
WOMEN’S VOCAL AGING: A LONGITUDINAL APPROACH
Markus Brückl; Technische Universität Berlin; brueckl@kgw.tu-berlin.de
Figure 1: Histogram of the speaker's chronological age (when they were recorded the first time). The speakers range between 26 and 87 years, the arithmetical mean (AM) is 44.89 years, the standard deviation (SD) 21.79 years.
Figure 2: Relative frequencies of chronologically correct (blue) vs. false (green) judgments, grouped per utterance type: 2 out of three speech samples are judged correctly. Sustained /a/ vowels are judged significantly false. The frequencies in the utterance types "(all) vowels", "stat", "offset" are not significantly different from chance level (50%). The much better judgments on (all) speech samples compared to those of (all) vowels indicate that there is much more (reliable) information on age in speech utterances.
Figure 3: Relative frequencies of listener-internal equal (blue) vs. different (green) judgments of sample-pairs consisting of same utterances, grouped per utterance: This indicates a overall, as well in sustained vowels relatively high internal consistency of the listeners.
Figure 4: Plot of the acoustic differences in nasal duration of read speech according the chronological age grouping. On the x-axis the speaker's age (when they were recorded the first time) is mentioned in order to recognise inter-individual trends (which are not evaluated by the Wilcoxon test): This shows a rather good indicator of chronological age (p=0.021). [green=later; blue=earlier]
Figure 5: Plot of the acoustic differences in articulation rate (AR)of read speech according the chronological age grouping. Since AR has formerly been found as a indicator of age this result is validated. (p=0.028). But it must be noted that this good result is heavily depending on the speakers reading the same text.
Figure 6: Plot of the acoustic differences in total duration of read speech samples according the chronological age grouping. The finding of Brückl & Sendlmeier (2003) of total duration being a similar good indicator then AR is validated. (p=0.038). But it must be noted that this good result is heavily depending on the speakers reading the same text.
Figure 7: Plot of the acoustic differences in the "center of formant concentration" of read speech according the chronological age grouping. Endres et al (1972) found this parameter to be lowering during five year steps in all examined phonemes and speakers. This result can not be validated: the lines are crossing several times, no intra-speaker-trend is recognisable, the Wilcoxon test is not significant.
AbstractA quasi-experimental longitudinal study was carried out to explore human auditory age perception abilities as well as to validate formerly proposed and detect further acoustic indicators of a female speaker's age.MotivationAlthough the research for acoustic cues of a speaker's agehas considerably increased, several questions remain:(I) Which acoustic speech features enable human listeners to estimate a speaker's age?(II) Which acoustic speech features allow to predict a speaker's age automatically?(III) How big is the "audibly just notable age difference"?(IV) Which factors influence the amount of information on age?ObjectivesThe main objectives of this study are to explore, whether aging for 5 years can(1) audibly and (2) measurably change women's vocalisations,and if so, on which acoustic information(3) the listeners’ performance possibly could relay on and(4) which parameters can contribute to detect the chronological differenceFurther objectives are to explore, (5) if the (linguistic) complexity of an utterance influences the accuracy of age perception.Speech Data9 adult female speakers (cf. Fig. 1) provided vocal utterances of three types (spontaneous speech, read speech and sustained vowels) two times, separated by 5 years. The complexity of these speech types is assumed to decrease in descending order.The sustained vowels (/a/, /i/ and /u/) were cut into three parts of equal length, containing either the vowel onset, a quasi-stationary middle part, or the vowel offset.This results in 11 different speech samples per speaker. Human PerceptionThese utterances were presented in pairs consisting of samples of the same type and speaker, but differing in the speaker's age. Each pair was presented twice, in both possible orders and in a new randomisation order to each listener. The listener's task was to decide which of the two samples sounds older. 34 listeners (14 male, 20 female, age: AM=31.35; SD=8.79) participated in the speech test, 31 of them as well in the vowel test.Do the listeners' judgments agree with the chronolo-gical difference?To answer this question, the following hypothesis was chi²-tested:H0: The frequency of chronologically correct vs. false judgments is determined by pure chance (the chance level is here 50%).This analysis was calculated separately for different sample type groups in order to evaluate differences in the amount of encoded age information between the different sample types (cf. Fig. 2).Results:Spontaneous and read speech samples were (chronologically) correctly judged in about 2 out of 3 cases. The probability of this empirical result if H0 is valid is less than 0.0005.For the sample type group of all vowels this probability is 0.214, indicating an insignificant result. The analysis of sub-groups of vowels indicates that vowel quality strongly interacts with chronological agreement: /a/-parts are falsely judged, whereas /i/ and /u/ are correctly judged, all highly significantly. Are the listeners able to judge consistently?To ensure listener-internal consistency of the judgments, each listener's two judgments on one sample pair are compared. The frequencies of equally vs. differently judged sample pairs were also evaluated with a chi²-Test:H0: The frequency of equal vs. different judgments of same sample pairs is determined by pure chance.Fig. 3 summarizes these frequency relations for the previously used sample type groups.Results:The listeners judged about 2 out of 3 sample pairs consisting of same samples equally, indifferently of the sample type. H0 must be rejected.
Acoustic Analysis75 Parameters supposed to depict the concepts pitch (24), hoarseness/ roughness/ breathiness (25), vocal tremor (9) and vowel quality (17) were extracted from each speech sample by PRAAT or the Multi-Dimensional Voice Program (KAY-Pentax).The read speech samples additionally were found suitable for a reliable measurement of duration and speech tempo parameters. Further, they seem comparable enough to extract certain parameters of the above mentioned concepts from certain segments (or segment groups, like "all vowels", "all nasals",...). The manually assessed segmentation depicts linguistic categories in order to reduce variability owed to linguistic modulation. This procedure yielded another 315 parameters. Which parameters vary significantly between the samples differing in their speaker's age? To evaluate differences between the parameter values from the chronologically younger and those from their paired older samples, Wilcoxon Tests were applied.H0: The parameter value differences between younger and older samples can be explained by chance.Results:This H0 can be rejected, given a 5%-alpha-error-level- in the sample type group of all sustained vowels for 32 out of the 75 applied parameters, mostly pitch and amplitude perturbation parameters; but perturbation measures are increased in the younger samples(!);- in the sample type group of read and spontaneous speech for 9 out of the 75 applied parameters, roughly reflecting the results obtained on the sustained vowels in a weakened manner.- in the segment-wise analysis of read speech for 44 out of the 315 applied parameters, indicating in the older samples longer durations, slower articulation rate, less breathing-out, lowered F0 (in vowel segments), but again decreased (amplitude) perturbations (in vowel segments).Which parameters vary according the listener's age perception?Again Wilcoxon tests and the same H0 are applied, but the age grouping is not determined by the "chronological facts" but by the listeners' perception (per speech sample not per speaker). Results:H0 can be rejected, given a 5%-alpha-error-level- in the sample type group of all sustained vowels for 44 out of the 75 applied parameters, including pitch, perturbation, noise and tremor parameters and F2; but values of pitch and amplitude perturbations are here raised in the older samples;- in the sample type group of read and spontaneous speech for 11 out of the 75 applied parameters; here amplitude perturbations are higher in the perceptually younger samples; - in the segment-wise analysis of read speech for 46 out of the 315 applied parameters, indicating in the older samples overall longer durations, but less duration contribution of plosives and fricatives, slower syllable-based articulation rate, longer pauses, and again decreased (amplitude) perturbations (in vowel segments).Interpretation(1) Humans are able to correctly judge if a speaker has aged for 5 years based on read or spontaneous speech samples in about 2 out of 3 cases. Based on sustained /i/ and /u/ vowels there are still significantly more correct judgments but sustained /a/ yields significantly more false judgments.(2) There are measurable differences between the utterances differing in their speaker's chronological age.(3) The listeners’ performance on sustained vowels seems to relay on concepts that correlate to (especially amplitude) perturbation parameters - although these applied age concepts seem to be misleading, at least for the 5 year age difference. Although the amplitude perturbations are comparable present in the speech samples, there seems to be enough additional information (speech tempo, durations, pitch,...) to achieve more reliable results. (4) The most promising acoustic speech parameters according to this study, are duration and tempo measures, eventually, if their modulation due to various factors could be determined, also pitch, tremor and formant measures.(5) The perception results indicate that more (linguistically) complex utterances yield a more chronologically accurate perception, but a reproducable "age evaluation scheme" exists even for sustained vowels (cf. Fig. 2).
Conclusions(I) Type and weight of acoustic cues on a a peaker's age that are exploited by human listeners seem to vary according to the listening task/ and according to the type/amount of information in the utterance. (II) Since human perception abilities still outperform technical approaches, it seems useful to try to adopt them and generate a task- dependent focussing/weighting mechanism of different acoustic cues. There seem to exist much more age-indicating cues than are known, possibly linked to concepts like prosody.The precise prediction of a speaker's chronological age seems to be limited by the fact that we age differentially.[Theoretical considerations suggest that the process of aging affects nearly all proposable subsystems involved in speech production, and all differentially](III) The just noticeable difference seems to vary according to the amount of information provided by an utterance: for sustained vowels it is possibly around 5 years. Concerning speech it seems to be smaller.(IV) More (linguistic) complexity of an utterance seems to provide more potential age information.