Okko      Research     Publications      Resources      Contact
Okko Räsänen Computational modeling of early language acquisition

Children learn their native language relatively effortlessly and without explicit human supervision while simply interacting with their language community. This is an extraordinary achievement when we start to consider the challenges involved in early language acquisition. For instance, starting without any a priori linguistic knowledge, infants have to learn to segment words out of continuous speech, learn how these words map to their referential meanings, how multiple words make up grammatical constructs with meaning different from the individual words alone, and how words are made up of smaller units such as phones and syllables that do not carry any meaning in isolation. The task of speech segmentation is greatly complicated by the fact that there are no obvious cues to word or phone boundaries in natural speech, and that there is an enormous amount of acoustic variability in speech that make same words sound totally different when spoken in different contexts or by different talkers.

Much of the existing work on human word learning assumes that the learner is capable of representing speech in terms of discrete categorical units such as phonemes. However, a central challenge for a language learner is to understand what type of structural units make up the language and what type of acoustic variation is relevant in categorization of these units. Although 6-month-old infants already show adaptation to the properties of their native language speech sounds, they are still highly sensitive to phonologically irrelevant acoustic variation in speech far into their second year of life. By definition, phonemic structure emerges from the minimal changes in word form that lead to a change in the meaning of the word (e.g., changing [p] in "pat" to [b] creates a word with a different meaning: "bat"), suggesting that phonemic representation of language cannot be learned before knowing any words of the language. Therefore any comprehensive account for early language learning should also explain how the learner overcomes the acoustic variability in speech.

In my research, I study how a learner can bootstrap the language acquisition process from scratch by only making minimal assumptions regarding the existing language knowledge. This includes investigation of what type of computational learning mechanisms, constraints, and information (data) sources are required for successful learning. I'm especially interested how infants learn to segment words and subword units such as phones and syllables in the presence of acoustic variability of natural speech with and without the help of contextual cues. My primary research method is to build computational systems that can perform human-like unsupervised learning on real-world data. My approach roughly follows the so-called statistical learning paradigm in developmental psychology that broadly refers to the idea of (language) learning as a process of discovering intra- and cross-modal regularities in the sensory input. In short, I try to create algorithms that learn from real speech and also match to the behavioral data on language learning.

In the existing work together with my colleagues, we have shown how recurring word forms can be learned from acoustic speech in purely unsupervised manner using a statistical learning algorithm (Räsänen, 2011) but how this learning is greatly facilitated by the presence of contextual referential cues such as visible objects or actions, allowing the learner to jointly solve the segmentation and meaning acquisition problems (Räsänen & Laine, 2012; Räsänen & Rasilo; 2015, submitted). Moreover, we have shown how the basic time-frequency processing characteristics of the human auditory system can be learned from ecologically relevant auditory stimuli using statistical learning (Räsänen & Laine, 2013), how perception of sentence stress can also be explained in terms of statistical learning of prosodic contours (Kakouros & Räsänen; 2014; submitted), and how statistical learning of short-term acoustic structure can provide cues to phone boundaries in speech (Räsänen, 2014). We have also investigated how a learner can acquire the mapping between caregiver speech and learner's own articulatory gestures and speech sounds using cross-situational learning (Rasilo, Räsänen & Laine, 2013).

More information can be found from the respective publications.


Contact: firstname.surname@aalto.fi