21 October 2013

Automatic Speaker Tracking

A central topic in spoken-language-systems research is what’s called speaker diarization, or computationally determining how many speakers feature in a recording and which of them speaks when. Speaker diarization would be an essential function of any program that automatically annotated audio or video recordings. To date, the best diarization systems have used what’s called supervised machine learning: They’re trained on sample recordings that a human has indexed, indicating which speaker enters when. However, MIT researchers describe a new speaker-diarization system that achieves comparable results without supervision: No prior indexing is necessary.


Moreover, one of the MIT researchers’ innovations was a new, compact way to represent the differences between individual speakers’ voices, which could be of use in other spoken-language computational tasks. To create a sonic portrait of a single speaker, Glass explains, a computer system will generally have to analyze more than 2,000 different speech sounds; many of those may correspond to familiar consonants and vowels, but many may not. To characterize each of those sounds, the system might need about 60 variables, which describe properties such as the strength of the acoustic signal in different frequency bands.

More information: