LabPhon 7
Themes
Once speakers have decided what words to use, they will have to
retrieve the corresponding phonological forms from their lexicons,
build phonological representations of utterances, and pass this
information on to the phonetic implementation module. In some cases,
the phonological form of the word will have to be pulled together from
different morphemes, as in the case of regular inflections in English,
while for simplex words and irregular forms like "went" (GO-Past) no
such assembly is needed. Where does the cut-off point lie here? For instance,
are null-inflections assembled? Can
speakers retrieve phonological forms "surgically", without interference
from phonological forms of semantically related words? And what is the
form of this phonological representation: are segments syllabified
and syllables metrified, or is the full representation still to be
assembled? Are the segments fully specified or will redundant features
have to be supplied, and if so at what stage? And once a full surface
representation has been constructed, complete with higher prosodic
structure, what sort of chunks are passed on to the phonetic
implementation? Are articulatory programmes built from scratch,
starting from the feature, or can ready-made higher-level programmes
be called on? And what happens when such ready-made programmes cross
word-boundaries and disturb the word-based prosodic structures
computed or retrieved earlier?
The information available to hearers in the acoustic speech signal is
very large indeed. When and how do listeners use what in order to
decide what they hear? A considerable research effort has been devoted
to the question of how listeners decide where words begin, that is,
when it is useful to entertain the hypothesis that a certain part of
the speech stream constitutes the beginning of a word. Listener
strategies appear to be language-specific: only if one's language has
word-based vowel harmony does it make sense to take sequential vowel
qualities into consideration when deciding where to begin a lexical
search. Likewise, the extent to which speakers may provide strings of
perceived segments with syllable structure may depend on the role
which syllable structure plays in the phonology of the language. How
do speakers cope with context-dependent segmental insertions,
deletions and assimilations? What role is played by the prosodic and
tonal structure? What kind of representation is available in the
lexicon with which perceived phonetic or phonological features can be
matched during a search? And when and how are various types of
non-phonological information brought to bear on the recognition
process?
Phonological theory aims to account for the shapes of the sound
systems of the world's languages. What segments, what metrical, tonal
and prosodic structures do languages have, how do they combine
linearly and hierarchically, and why are these segments and structures
statistically distributed the way they are, within and across
languages? A prerequisite to serving this aim is the availability of
reliable data. Although it may go too far to say that every new
language provides at least one aspect which overturns our unspoken
conceptions of what can exist, it is fair to say that new data
continue to throw unexpected light on current conceptions of
phonological structure, and we are far from feeling confident that we
know what we need to know. The current threat of extinction that looms
over the stock of spoken languages makes the crucial role of field
work all the more conspicuous. Greater mobility, together with the
availability of high-quality recording and analysis techniques, have
widened our notion of "field" in the sense that the field may be the
laboratory at the other end of the corridor from our offices, and that
the crucial element in this theme is theoretical advances built on
"primary data".
More so than in speech synthesis, advances in speech recognition have
been possible without any appreciable contribution of phonological
theory. This is because researchers have worked with pattern
recognition techniques which are independent of the medium within
which those patterns exist. Indeed, success in optical recognition has
likewise been possible in the absence of a theory of visual
perception. In many ways, however, the success of current speech
recognition systems is limited. Personalised dictation systems as well
as systems with unlimited numbers of speakers and limited sets of
expressions to be recognised fall far short of the performance
achieved by humans, who recognise unlimited sets of expressions spoken
by unlimited sets of speakers. A breakthrough can perhaps be forced by
a consideration of the way humans identify linguistic units in a
situation where their acoustic properties are highly variable, due to
interactions with speaking style, speaker, and nearby other
units. This would allow the current knowledge-shy recognition strategy
to be replaced with one that makes non-trivial use of phonological
representations. Possibly, too, progress in speech synthesis can be
based on a careful implementation of phonological accounts, particularly
in the area of prosody.
Phonological representations are conglomerates of discrete features
and structures selected from a finite set. This is how we assume
humans solve the onerous task of remembering the sound forms of the
words they know. What speakers produce and hearers receive, however,
are continuously varying acoustic patterns, whose shapes are
determined by the ergonomics of vocal sound production and perception.
It is evident that the nature of the set of features and structures is
historically indebted to the ergonomics of speaking and perceiving,
and that many changes that occur over time are at least in part
determined by these ergonomics. But how direct are these relations?
Does a speaker's phonology change with every change in the phonetics,
and if not, how much leverage are speakers allowed in the phonetic
implementation? Or is the notion that phonological representations and phonetic
implementation are separate modules perhaps misguided? How phonetic are
phonological representations, and to what extent is phonetic implementation
aiding and abetting in the signalling of phonological contrasts? Do features
refer to articulatory states, to articulatory gestures, or to auditory
effects, or perhaps to all of these at the same time?