A central topic in
spoken-language-systems research is what’s called speaker diarization, or
computationally determining how many speakers feature in a recording and which
of them speaks when. Speaker diarization would be an essential function of any
program that automatically annotated audio or video recordings. To date, the
best diarization systems have used what’s called supervised machine learning:
They’re trained on sample recordings that a human has indexed, indicating which
speaker enters when. However, MIT researchers describe a new
speaker-diarization system that achieves comparable results without
supervision: No prior indexing is necessary.
Moreover, one of the MIT
researchers’ innovations was a new, compact way to represent the differences
between individual speakers’ voices, which could be of use in other
spoken-language computational tasks. To create a sonic portrait of a single
speaker, Glass explains, a computer system will generally have to analyze more
than 2,000 different speech sounds; many of those may correspond to familiar
consonants and vowels, but many may not. To characterize each of those sounds,
the system might need about 60 variables, which describe properties such as the
strength of the acoustic signal in different frequency bands.
More information: