|
|
| 2008 | 2007 | 2006 | 2005 |
| 2004 | 2002 | 2001 | 2000 | 1999 | 1998 | 1997 | 1996 |
| 1995 | 1994 | 1993 | 1992 | 1991 | 1989 | 1988 | 1987 |
de Haan, P. & K. van Esch.
Measuring and assessing the development of foreign language writing competence,
in Porta Linguarum, 9: 7-21.
This paper discusses the development of writing skills of Dutch students of English
and Spanish as foreign languages. Essays written in three consecutive years were
analyzed for essay length, word length, and type/token ratio - reflecting linguistic
competence. A selection of essays was analyzed for argument structure and the use
of cohesive devices. These same essays were ranked holistically by experienced lecturers.
Students develop linguistic and discourse competences, but differ according
to language, proficiency level, and year of study. Assessors' arguments for
ranking are related mainly to the students' linguistic competence. The implications
of our findings for research and teaching are discussed.
Key words: foreign language writing, discourse competence, writing development, assessment.
de Haan, P. & K. van Esch.
Measuring and assessing the development of foreign language writing competence,
in Porta Linguarum, 9: 7-21.
En este artículo se describe el desarrollo de la competencia de escribir de
alumnos universitarios de inglés y de español como lenguas extranjeras en los
Países Bajos. Ensayos escritos en tres años consecutivos fueron analizados en
cuanto al tamaño de los ensayos, el tamaño de las palabras y el type/token ratio,
aspectos que reflejan la competencia lingüística. Una selección de estos ensayos
fue analizada en cuanto a la competencia discursiva y el uso de mecanismos de cohesión.
Los mismos ensayos fueron evaluados y ordenados holístamente por profesores expertos.
Resulta que los alumnos desarrollan sus competencias lingüística y discursiva
pero que se diferencian según la lengua que estudian, su nivel de competencia
y su año de estudio. Los argumentos de los evaluadores están relacionados sobre
todo con el nivel de competencia lingüística de los alumnos. Se discuten las
implicaciones de estos resultados para la investigación y la enseñanza.
Palabras clave: escritura en lengua extranjera, competencia discursiva, desarrollo
de la escritura, evaluación
de Haan, P. English writing by
Dutch-speaking students, in Teubert, W. & R. Krishnamurthy (eds.) Corpus Linguistics (Critical Concepts in Linguistics). Vol VI. London - New York: Routledge. 92-101..
In two studies (de Haan, 1997;
de Haan, 1998) I have reported on experimental work
involving the comparison of advanced learners' writing with native students' writing,
in an attempt to gain more insight in the syntactic behaviour of advanced learners of
English. Both studies were based on the investigation of tagged versions of four subsets
of the ICLE corpus (cf. Granger 1993). Sequences of two, three or four tags were taken
to be indicative of syntactic structures. A quantitative study of these sequences revealed
certain differences, but also similarities between the native students and the learners on
the one hand, and among the three groups of learners on the other.
In the current article I make a comparison between the Dutch-speaking students and the
native students, with a view to assessing the quality of the Dutch-speaking students'
writing. Dutch and English are closely related languages and Dutch-speaking students
of English have traditionally been expected to attain a "near-native" level of oral
and written proficiency of English at universities both in The Netherlands and in Belgium.
The study of the ICLE data on the microscopic and on the macroscopic level both suggest
that differences in syntactic behaviour between Dutch and native students are due to a
less extensive command of vocabulary by the Dutch learners; a less extensive command of
vocabulary will inevitably lead to a less informative text, giving it a less sophisticated
appearance. It is this aspect of writing that, more than the mechanics of the production
of grammatically correct sentences, needs to be given more attention.
Reference
Granger, Sylviane. 1993. International corpus of learner English, in Aarts, J.,
P. de Haan and N. Oostdijk (eds.): English language corpora: Design, analysis and
exploitation. Amsterdam: Rodopi. 57-69.
de Haan, P. & K. van Esch.
Assessing the development of foreign language writing skills: Syntactic and lexical
features,
in Fitzpatrick, E. (ed.) Corpus Linguistics Beyond the Word: Corpus Research
from Phrase to Discourse.
185-202. Amsterdam - New York, NY: Rodopi.
In de Haan & van Esch (2004; 2005) we
outline a research project designed to
study the development of writing skills in English and Spanish as foreign
languages, based on theories developed, for instance, in Shaw & Liu (1998) and
Connor & Mbaye (2002). This project entails collecting essays written by
Dutch-speaking students of English (EFL writing) and Dutch-speaking students of
Spanish (SFL writing) at one-year intervals, in order to study the development
of their writing skills, both quantitatively and qualitatively. The essays are
written on a single prompt, taken from Grant & Ginther (2000), asking the students
to select their preferred source of news and give specific reasons to support
their preference. Students' proficiency level is established on the basis of
holistic teacher ratings.
A first general analysis of the essays has been carried out with WordSmith
Tools. Moreover, the texts have been computer-tagged with Biber's tagger (Biber,
1988; 1995). An initial analysis of relevant text features (Polio, 2001) has
provided overwhelming evidence of the relationship between a number of basic
linguistic features and proficiency level (de Haan & van Esch,
2004; 2005).
In the current article we present the results of more detailed analyses of
the EFL material collected from the first cohort of students in two consecutive
years, 2002 and 2003, and discuss a number of salient linguistic features of
students' writing skills development. We first discuss the development of general
features such as essay length, word length and type/token ratio. Then we move on
to discuss how the use of specific lexical features (cf. Biber, 1995; Grant &
Ginther, 2000) has developed over one year in the three proficiency level groups
that we have distinguished. While the development of the general features over
one year is shown to correspond logically to what can be assumed to be increased
proficiency, the figures for the specific lexical features studied do not all
point unambiguously in the same direction.
References
Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.
Biber, D. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.
Connor, U. & Mbaye, A. 2002. Discourse approaches to writing assessment, Annual Review of Applied Linguistics, 22: 263-278.
Grant, L. & Ginther, A. 2000. Using computer-tagged linguistic features to describe L2 writing differences, Journal of Second Language Writing, 9: 123-145.
Polio, C. 2001. Research methodology in L2 writing assessment, in T. Silva & P. K. Matsuda (eds.), On Second Language Writing. Mahwah, NJ: Lawrence Erlbaum Associates. 91-115.
Shaw, P. & Liu, E. 1998. What develops in the development of second-language writing?, Applied Linguistics, 19: 225-254.
van Esch, K., P. de Haan,
L. Frissen, I. González Santero & A. de la Torre Miranda. Evolución en la
competencia escrita de estudiantes de español como lengua extranjera, in
Estudios de Lingüística Aplicada, 24, 43: 55-76.
Este estudio continúa las investigaciones de van Esch, de Haan & Nas y de Haan & van Esch sobre el desarrollo de la escritura en inglés y español como lenguas extranjeras y de Teijeira Rodríguez, van Esch & de Haan sobre la estructura, la coherencia y la cohesión en ensayos escritos en español por estudiantes holandeses y estudiantes nativos de español. La pregunta central de este estudio es si se produce en la calidad de los ensayos escritos por no nativos una evolución significativa hacia características que se consideran propias de textos escritos por nativos de español. Se termina esbozando sugerencias para futuros campos de estudio y apuntando algunas aplicaciones prácticas para la enseñanza del español como lengua extranjera.
van Esch, K., P. de Haan,
L. Frissen, I. González Santero & A. de la Torre Miranda. Evolución en la
competencia escrita de estudiantes de español como lengua extranjera, in
Estudios de Lingüística Aplicada, 24, 43: 55-76.
This study is a continuation of the studies by van Esch, de Haan & Nas and de Haan & van Esch on the development of English and Spanish foreign language writing, and by Teijeira Rodríguez, van Esch & de Haan on the structure, coherence and cohesion in Spanish essays written by Dutch university students and by native Spanish students. The central question in the current paper is whether there is a significant development in the writing produced by non-native students in the direction of what is considered typical of native Spanish texts. The paper is concluded with suggestions for further research in this area and practical applications for the teaching of Spanish as a foreign language.
Teijeira Rodríguez,
M., K. van Esch & P. de Haan. La coherencia y la cohesión en textos escritos por
estudiantes neerlandeses de español como LE, in
Estudios de Lingüística Aplicada, 23, 41: 67-100.
En este artículo se presentan los resultados del análisis de la coherencia y la cohesión
de un corpus de textos de hablantes nativos de español y otro corpus de textos escritos
por holandeses en español como lengua extranjera (LE).
Los textos analizados forman parte del conjunto de textos recopilados en el marco de
una investigación más amplia que se realiza en la Universidad de Nimega entre 2002 y
2008 sobre la escritura de ensayos argumentativos en inglés LE y en español como LE
y lengua materna (L1), cuyos primeros resultados fueron publicados en 2004 en Estudios
de Lingüística Aplicada por van Esch, de Haan y Nas y que
conciernen principalmente los aspectos cuantitativos de los textos.
En este artículo ofrecemos los resultados de los análisis desde la perspectiva cualitativa
en relación a las variables coherencia y cohesión. En la primera parte se tratan aspectos
discursivos de la escritura en L2 y LE. En la segunda parte se presentan la metodología
utilizada y los resultados de los estudios que hemos realizado hasta el momento.
Teijeira Rodríguez,
M., K. van Esch & P. de Haan. La coherencia y la cohesión en textos escritos por
estudiantes neerlandeses de español como LE, in
Estudios de Lingüística Aplicada, 23, 41: 67-100.
This article presents the results of the analysis of coherence and cohesion in a
corpus of native Spanish texts and a corpus of texts written by Dutch students of
Spanish as a foreign language (FL).
The texts analyzed are part of a group of texts collected for a larger research
project carried out at the University of Nijmegen between 2002 and 2005 about the
writing of argumentative essays in English as a FL and in Spanish as a foreign language
and as a native language. The first results are published in 2004 in Estudios de
Linguística Aplicada by van Esch, de Haan and Nas and are
concerned primarily with quantitative aspects of the texts.
In this article we present the results of the analyses from a qualitative perspective
in relation to the variables coherence and cohesion. The first part discusses discursive
aspects of writing. In the second part we present the methodology adopted, and the
results of the analyses that have been carried out so far.
de Haan, P. & K. van Esch.
The development of writing in English and Spanish as foreign languages,
in Assessing Writing, 10: 100-116.
This article presents the first results of the study of argumentative essays in English as a foreign language and in Spanish as a foreign language and as a native language, carried out at Radboud University Nijmegen between 2002 and 2008. The aim of the project is to compare teachers' holistic assessments with the results of the quantitative analysis of syntactic and lexical features of the essays and measure the development in writing skill from one year to the next. The first part of the current article discusses general aspects of native language and foreign language writing, and establishes the place of our project in the discipline. In the second part we present the methodology adopted, and the results of the quantitative analyses that have been carried out so far.
de Haan, P. & K. van Esch.
Towards an instrument for the assessment of the development of writing skills,
in Connor, U. and Th. Upton (eds.) Applied Corpus Linguistics: A Multidimensional Perspective.
267-279. Amsterdam - New York, NY: Rodopi.
An important aspect of academic foreign language writing courses is assessing and grading the
quality of students' writing products. This can be done by using holistic or analytical scales
or by ranking. What is needed for the specific Dutch context is an instrument geared towards
the specific objectives and context of our foreign language courses, which can help the teacher
to assess students' written products with more validity and which can be used to assess students'
progress over time. A joint project, aiming at developing such an instrument for the specific
Dutch context, has recently started at the departments of English and Spanish in Nijmegen, in The Netherlands.
The present article describes the first step towards developing the above-mentioned instrument:
the set-up of two modest-sized "longitudinal" learner corpora, one for Spanish and one for English.
These corpora will contain learner essays written under controlled conditions and on pre-defined topics.
The first batch of student essays was collected in March 2002. Lexical and syntactic analyses of
these essays will provide a unique insight into the development of the students' writing skills.
A first general quantitative analysis of the essays has already yielded a number of interesting
observations. The article concludes with a tentative suggestion for a more elaborate instrument
to relate student performance to teacher assessment.
van Esch, K., P. de Haan & M. Nas.
El desarrollo de la escritura en inglés y español como lenguas extranjeras, in
Estudios de Lingüística Aplicada, 22, 39: 53-79.
En este artículo se presentan los primeros resultados de un proyecto de investigación realizado en la universidad de Nijmegen entre 2002 y 2005 sobre la escritura de ensayos argumentativos en inglés como lengua extranjera (LE) y en español como LE y lengua materna (L1). Los fines de este proyecto son comparar valoraciones holísticas con los resultados de análisis de características sintácticas y léxicas de los ensayos, medir el desarrollo de los ensayos de un año a otro e investigar el proceso de feedback y la revisión de textos. En la primera parte se tratan aspectos importantes de la escritura como la competencia de escribir, las diferencias y semejanzas entre la escritura en L1 y L2 y las maneras de analizar las características de textos en L2 y LE. En la segunda parte se presentan la metodología utilizada y los resultados de los proyectos llevados a cabo hasta el momento.
van Esch, K., P. de Haan & M. Nas.
El desarrollo de la escritura en inglés y español como lenguas extranjeras, in
Estudios de Lingüística Aplicada, 22, 39: 53-79.
This article presents the first results of a research project on the writing of argumentative essays in English as a foreign language and in Spanish as a foreign language and as a native language, carried out at the University of Nijmegen between 2002 and 2005. The object of the project is to compare teachers' holistic assessments with the results of the analysis of syntactic and lexical features of the essays, measure any development in writing skill from one year to the next, and to study the process of feedback and text revision. The first part discusses major aspects of writing, such as general writing competence, differences between native language writing and second and foreign language writing, and ways to analyse text features in second and foreign language texts. In the second part we present the methodology adopted, and the results of the projects that have been carried out so far.
de Haan, P.
The non-nominal character of spoken English, in Breivik, L.E. and A. Hasselgren
(eds.) From the COLT's mouth ... and others'. 59-69. Amsterdam - New York, NY: Rodopi.
This article seeks to find confirmation for the claim (cf. Biber et al. 1998; Biber et al.
1999; de Haan 2001) that the syntactic differences between spoken and written English
virtually all point to the same conclusion, viz. that the written variety has a strong
nominal character, whereas the spoken variety has a strong verbal, or clausal character.
In other words, the noun phrase, with its noun phrase functions, is the typical central
unit of the structure of written English, whereas the clause, with its clause functions,
is a far more typical unit of the structure of spoken English.
This article is based on the assumption that the non-nominal character of spoken
English is shown in the relative absence of nouns ending in typical nominalisation
suffixes like -ness, -ity, -ance, -ation, etc., as well as in the differences in
syntactic make-up between NPs centred around these nouns in spoken and written English.
The data have been collected from the BNC sampler CD-ROM, which comprises 1 million
words of spoken English and 1 million words of written English.
It is shown that there is a cline from informal spoken language to informative
writing, in that the non-nominal character of spoken English is most outspoken
in informal texts, while more formal spoken interactions have a more nominal
character than imaginative writing. Informative writing has the strongest nominal
character.
References
Biber, D., S. Conrad and R. Reppen. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.
Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. Longman grammar of spoken and written English. Harlow: Longman.
de Haan, P.
Whom is not dead?, in Peters, P., P. Collins and A. Smith
(eds.) New Frontiers of Corpus Research. Amsterdam - New York, NY: Rodopi.
This article reviews some of the recent literature on the ongoing discussion about
the use of WHOM in spoken and written English. Some linguists claim that WHOM is
artificially kept alive by prescriptive grammarians, and that is has virtually
disappeared from the spoken language.
A corpus investigation on the occurrence of WHOM in a number of recent English
corpora yields the following:
de Haan, P.
Aspects of the syntax of spoken English, in Aijmer, K. (ed.) A wealth of
English. 47-56. Gothenburgh: Gothenburgh University Press.
This article presents the preliminary results of a study into a small number of
syntactic characteristics of spoken language, based on the occurrence of frequent
word class sequences. I have recently also used this technique in two experimental
studies of English learner data (de Haan, 1997;
de Haan, 1998).
This study confirms that very crude syntactic differences between spoken and
written data can be established very easily on the basis of the occurrence of
tags and tag sequences.
Secondly, this study shows quite clearly that spoken English has a tendency to
be more "clausal" than written language, which is more "nominal". This tendency
is clear not only from the inspection of the occurrence of the single word classes,
but especially on the word class sequences. As the sequences become longer they
present more detailed syntactic structures which confirms the findings on the basis
of the shorter sequences.
de Haan, P.
Tagging non-native English with the TOSCA-ICLE tagger, in Mair, C. and M. Hundt
(eds.) Corpus linguistics and linguistic theory. 69-79. Amsterdam: Rodopi.
The TOSCA-ICLE Tagging Unit (TU) has been in use for some time now to tag (part
of) the material in several research centres participating in the ICLE project.
The TU comprises among other things an automatic tagger, and a tag selection
program, which can be used to correct tagger output (the tagger has a success
rate of about 95 per cent with non-native material).
The tagger has been derived from the TOSCA tagger, which was originally designed
for tagging error-free native English (as most taggers are), and was consequently
trained on native material. Applying the TOSCA-ICLE tagger to the ICLE material
presents us with a number of problems that can be said to be unique to learner
material. No guidelines have been proposed in the "ICLE community" so far, as to
how to deal with these problems. Guidelines would be needed, however, for those
ICLE members who wish to apply the Tag Selection Program to produce error-free
tagged versions of their ICLE material.
This article illustrates the kinds of errors learners are apt to make, on the
basis of experience with Czech, Dutch and Spanish ICLE material. These include
keyboard errors, spelling errors, lexical and grammatical errors. A tentative
classification of these errors is presented, as well as a proposed way of dealing
with them.
de Haan, P. English writing by
Dutch-speaking students, in Hasselgård, H. and S. Oksefjell (eds.)
Out of Corpora. 203-212. Amsterdam: Rodopi.
In two recent studies (de Haan, 1997;
de Haan, 1998) I have reported on experimental work
involving the comparison of advanced learners' writing with native students' writing,
in an attempt to gain more insight in the syntactic behaviour of advanced learners of
English. Both studies were based on the investigation of tagged versions of four subsets
of the ICLE corpus (cf. Granger 1993). Sequences of two, three or four tags were taken
to be indicative of syntactic structures. A quantitative study of these sequences revealed
certain differences, but also similarities between the native students and the learners on
the one hand, and among the three groups of learners on the other.
In the current article I make a comparison between the Dutch-speaking students and the
native students, with a view to assessing the quality of the Dutch-speaking students'
writing. Dutch and English are closely related languages and Dutch-speaking students
of English have traditionally been expected to attain a "near-native" level of oral
and written proficiency of English at universities both in The Netherlands and in Belgium.
The study of the ICLE data on the microscopic and on the macroscopic level both suggest
that differences in syntactic behaviour between Dutch and native students are due to a
less extensive command of vocabulary by the Dutch learners; a less extensive command of
vocabulary will inevitably lead to a less informative text, giving it a less sophisticated
appearance. It is this aspect of writing that, more than the mechanics of the production
of grammatically correct sentences, needs to be given more attention.
Reference
Granger, Sylviane. 1993. International corpus of learner English, in Aarts, J.,
P. de Haan and N. Oostdijk (eds.): English language corpora: Design, analysis and
exploitation. Amsterdam: Rodopi. 57-69.
de Haan, P. How 'native-like'
are advanced learners of English?, in Renouf, A. (ed.),
Explorations in corpus linguistics. 55-65. Amsterdam: Rodopi.
In this article an attempt is made at assessing the "near-nativeness" of advanced
learners of English with respect to the syntactic structures that they use in their
written production. This is done on the basis of a study of part of three learner
corpora in the ICLE project, and part of the native student control corpus, which is
also part of the ICLE project.
Although there is, as yet, no syntactically analysed learner corpus material available,
some insight into syntactic structuring may be obtained from studying sequences of word
class tags. The material studied has been tagged with an experimental version of the new
TOSCA-ICLE tagger/lemmatizer. The tagged output was used to extract information about tag
sequences.
It is shown that on the whole advanced learners' writing is not easy to distinguish
from native students' writing with respect to the syntactic structures that have been
established via the tag sequences. It is suggested that the differences that are observed
between the advanced learners and the native students may be due to a lack of lexical
knowledge of the learners, which leads to a difference in the way they use certain
syntactic structures. The conclusion is that differences between learner and native
material are more subtle than can be established on the basis of a study of tag sequences
and that for a better comparison we would indeed need to use syntactically fully analysed material.
de Haan, P. On the use of rank-frequency distributions,
in Fries, U., V. Müller & P. Schneider (eds.), From Ælfric to the New York Times.
125-137. Amsterdam: Rodopi.
This article reports on two experiments on rank-frequency distributions.
First an attempt is made at using rank-frequency distributions as a way of
identifying different authors in a collection of texts, by means of
correspondence analysis (ANACOR) plots.
Next, the problems involved in determining the relationship
between a population and samples taken from this population are gone into.
It is shown that a rank-frequency distribution of a sample cannot be
taken as a reliable estimate of the rank-frequency distribution of the
population from which that sample has been taken.
de Haan, P. An experiment in English learner data
analysis, in Aarts, J. I. de Mönnink & H. Wekker (eds.), Studies in English language
and teaching. 215-229. Amsterdam: Rodopi.
This article presents the first results of an attempt at analysing advanced English learners'
performance data. Learner data of students of English with three different native languages
are compared with each other and with native students data. The aim is to establish typical
learner usage patterns. This is done on the basis of word class combinations.
The data examined are part of ICLE, the International Corpus of Learner English.
Four sets of data are examined: Dutch, Finnish, French and native students data.
It is shown that it is not possible to base any firm conclusions on the basis of
word class combinations, and that fully syntactically analysed material would be
needed to study any typical learner usage patterns.
However, while it is shown that the native students make use of a greater vocabulary
range, they are shown to have a tendency to use fewer different types of tag combinations.
The differences between the natives and the learners become greater as the tag combinations
become bigger. This suggests that the learners, even though they are very advanced, have
not incorporated into their system a typical use of grammatical patterning to the same
extent as the native speakers. This may be accounted for by a greater exposure of the
natives to their language.
de Haan, P. Syntactic characteristics of dialogue and non-
dialogue sentences in fiction writing, in Ljung, M. (ed.)
Corpus-based studies in English. 101-117. Amsterdam: Rodopi.
In de Haan (1996) it is proposed that a detailed study be carried out into intra-textual
variation within fiction texts, with a view to gaining more insight into the specific
syntactic characteristics of dialogue and non-dialogue sentences.
Previous studies (e.g. Biber and Finegan, 1986) convincingly show that on several
dimensions fiction texts take a kind of middle position between more formal writing
on the one hand, and spoken (face-to-face) conversation on the other. However, no
account is taken, in these studies, of any possible intra-textual differences. More
recently, it has been shown (de Haan, 1992) that the length and the clause patterns
of dialogue sentences differ significantly from those of non-dialogue sentences in
fiction texts.
Likewise, Biber and Finegan (1994), using an MF/MD analysis based on the study of a
great number of syntactic features, have studied intra-textual variation within medical
research articles. They show that there are significant differences between different
sections of such articles. These differences can be interpreted in the light of the
different communicative functions of the various sections.
In the present article it is shown, on the basis of a study of a number of syntactic
characteristics in a number of fiction texts, that the differences between dialogue
and non-dialogue sentences can be described more generally in syntactic terms. Also,
a number of author-specific characteristics which play a role in this are gone into.
References:
Biber, D. and E. Finegan. 1986. 'An initial typology of English text types', in Aarts, J. and W. Meijs (eds), Corpus Linguistics II. 19-46. Amsterdam: Rodopi.
Biber, D. and E. Finegan. 1994. 'Intra-textual variation within medical research articles', in Oostdijk, N. & P. de Haan (eds), Corpus-based research into language. 201-221. Amsterdam: Rodopi.
de Haan, P. and H. Van Halteren.
The TOSCA-ICLE Tagset.
Software Manual. 24 pp. Nijmegen: TOSCA Research Group, Department of Language and Speech.
The TOSCA-ICLE tagger/lemmatizer is a software suite for tagging English text with wordclass tags from the TOSCA-ICLE tagset, designed for the International Corpus of Learner English (ICLE). The suite contains an automatic tagger/lemmatizer, a tagged file reformatter and a program for manual tag selection. This publication is the user manual for installation and use of the software suite.
de Haan, P. More on the language of
dialogue in fiction, in
ICAME Journal, 20: 23-40.
This paper takes Nelleke Oostdijk's (1990) article 'The language of dialogue in fiction'
as its starting point. Oostdijk presents her findings in support of the claim that dialogue
in fiction has its own characteristics, which makes it like, but at the same time also
different from, spoken language. It is like spoken language, as it clearly reflects the
author's attempt to represent spoken conversation. It is clearly unlike spoken language
as it is planned, revised, and edited. Oostdijk assumes a continuous scale with planned
discourse at one extreme and unplanned discourse at the other, suggesting that dialogue
in fiction is found somewhere between these extremes.
Oostdijk presents an overall picture of five fiction samples from the TOSCA corpus,
but suggests that a number of observations point to the existence of idiosyncratic
variation among authors. It is at this point that I start, in an attempt to find out
in which characteristics this idiosyncratic variation manifests itself most clearly.
I look into the following aspects:
Reference
Oostdijk, N. (1990), 'The language of dialogue in fiction', in: Literary and Linguistic Computing, Vol. 5: 235-241.
Schils, E. & P. de Haan. Mortons Cusum ontrafeld, in
Gramma/TTT, 3: 129-141.
In The Qsum Plot (Morton and Michaelson, 1990) it is claimed
that cumulative sum charts (cusum charts, for short, or cusum
plots) can be used to determine authorship of different text
fragments or detect corruptions of texts. This is done by
plotting the cusum charts of, for instance, the number of nouns
of the sentences together with that of the total number of
words in the sentences of a text fragment. This procedure is
based on the assumption that authors have a certain, more or
less constant "habit" in their use of, for instance, the number
of nouns relative to the total number of words they use in the
sentences they produce. It also assumes that the habits authors
have are so idiosyncratic that even the slightest corruption
can be detected. For instance, an insertion of five sentences
by Hardy into 40 or so sentences by Scott can be readily shown.
Morton's method has been criticised recently in a number
of publications. All of these express serious doubts as to the
validity of the procedure. Indeed, our own corpus data suggests
that there is considerable intra-author variation.
In this article, however, the main emphasis is on
showing the unreliability of the cusum technique.
Reference:
Morton, A., en S. Michaelson. 1990. The Qsum plot. Internal Report CSR-3-90. University of Edinburgh: Department of Computer Science.
Oostdijk, N. & P. de Haan (eds.).
Corpus-based research
into language. Amsterdam:
Rodopi.
The topics in this book are centred around the syntactic analysis of corpora. It was decided to focus on contributions that highlight recent developments in related work in the field of corpus linguistics done elsewhere. This means that the editors have taken the research areas pursued by the TOSCA Group in Nijmegen as a lead to the thematic arrangement of the contributions in this book.
Three parts can be distinguished:
de Haan, P. & E. Schils. The Qsum plot exposed, in Fries,
U., G. Tottie & P. Schneider (eds.), Creating and using
English language corpora. 93-105. Amsterdam: Rodopi.
In the Qsum Plot (Morton and Michaelson, 1990) it is claimed that cumulative sum charts,
(cusum charts, for short, or cusum plots) can be used to determine authorship of different
text fragments or detect corruptions of texts. This is done by plotting the cusum charts of,
the number of nouns of the sentences together with that of the total number of words in
sentences of a text fragment. This procedure is based on the assumption that authors have
a certain, more or less constant `habit' in their use of, for instance, the number of
nouns relative to the total number of words they use in the sentences they produce.
It also assumes that the habits authors have are so idiosyncratic that even the slightest
corruption can be detected. For instance, an insertion of five sentences by Hardy into
40 or so sentences by Scott can be readily shown.
Morton's method has been criticised recently in a number of publications. All of these
express serious doubts as to the validity of the procedure. Indeed, our own corpus data
suggest that there is considerable intra-author variation.
In this article, however, the main emphasis is on showing the unreliability of the
cusum technique.
Reference
Morton, A. and Michaelson, S. (1990), `The Qsum plot', in: Internal report CSR-3-9. University of Edinburgh: Department of Computer Science.
Oostdijk, N. & P. de Haan. Clause patterns in modern British
English. A corpus-based (quantitative) study, in ICAME
Journal, 18: 41-79.
In this article a research project is discussed which is carried out by the TOSCA Research Group. It aims to provide a survey of the frequency of occurrence and the distribution of a range of syntactic structures in Modern British English. The project makes use of the Nijmegen Corpus. This computerised corpus, comprising approximately 130,000 words, has undergone a detailed syntactic analysis and is available for explorative studies. This article reports on findings with regard to the clause patterns encountered in the material.
de Haan, P. Noun phrase structure as an indication of text
variety, in Anglistik &
Englischunterricht, 49: 85-106. *(Special issue:
The noun phrase in English: Its structure and
variability).
This article discusses the relationships that exist among the type and structure
of the noun phrase and its function and the text type in which it occurs. The
discussion is based on an examination of the distribution of various noun phrase
types in different functions in a collection of texts, the Nijmegen Corpus.
It is demonstrated that the functions of the noun phrases, as well as their combinations,
play an important role in the distribution of noun phrase types and noun phrase structures.
In addition, the type of text contributes to this distribution.
Moreover, on the level of the noun phrase postmodifier, it is demonstrated that both
the actual number of postmodifiers and the different kinds of postmodifiers are distributed
differently in the texts examined.
The general conclusion is that in the non-fiction texts more complex noun phrases
occur than in the fiction texts. This difference is especially noticeable in subject
noun phrases. However, underlying this more general tendency there are others, some
of which reinforce this general tendency, whereas others partly cancel it. Although
the differences in distribution are not dramatic, they are statistically significant
and would warrant the conclusion that noun phrase structure is an indication of text type.
de Haan, P. Sentence length in running text, in Souter, C. &
E. Atwell (eds.), Corpus-based computational
linguistics. 147-161. Amsterdam: Rodopi.
This article is an attempt to determine what distinguishes running text from random collections of sentences. In an experiment at reducing a 20,000 word sample to smaller stretches of words, de Haan (1992) showed that:
The purpose of this article is to make a first step towards a better understanding of the nature of the relationships that hold between consecutive sentences in a running text. The main emphasis in this article is on the sentence length. It is considered by means of a number of different statistical techniques, viz. autocorrelations, moving averages and the mean squared jump.
Aarts, J., P. de Haan & N. Oostdijk (eds.). English
language corpora: Design, analysis, Exploitation. Papers
from the 13th ICAME conference. Amsterdam: Rodopi.
This volume contains a selection of the papers read at the Thirteenth Conference on the
Use of Computer Corpora in English Language Research (ICAME 13), which was held at Nijmegen,
the Netherlands, in June 1992. The selection has been made so as to represent the three major
activities involved in the use of computer corpora in linguistic research: the design and
compilation of corpora, their grammatical analysis and their subsequent exploitation.
In his contribution to this volume, Gerry Knowles uses the word `antiquated' with reference
to the Lancaster/IBM Spoken English Corpus - a corpus that was compiled in the mid-1980s.
This is just one of many indications that things are moving fast in the discipline of corpus
linguistics. A volume of papers that aims to give an impression of the state of the art,
should therefore try to follow developments as closely as possible; not only in the chronological
sense of trying to publish an account of the most recent research efforts but also in the
sense of paying close attention to the day-to-day practice of corpus linguistic research.
For that reason, many of the papers in this volume have the nature of progress reports of
ongoing projects, while others discuss quite concretely the sort of problems the corpus
linguist is confronted with in the everyday practice of his research.
Two general trends that can recently be observed in the international corpus linguistics
community, are also reflected in this collection of papers. One of these is in the field
of corpus design. Since extremely large corpora have appeared on the scene (British National
Corpus, Bank of English) the focus in corpus construction has shifted away from the smaller
balanced corpus of contemporary English like LOB and Brown, to corpora that either contain
some special variety of the contemporary language or texts from some earlier stage of the
English language. Especially the latter type of corpus, coming in the wake of the Helsinki
corpus, seems to be enjoying a new popularity. Another recent feature of corpus linguistic
research has been the growing concern for standardization; this concern is also audible in
some of the papers in this volume.
As the title suggests, the volume is divided into three sections, each dealing with
one of the major aspects of corpus linguistic research: corpus compilation and design,
corpus analysis and corpus exploitation.
In the section on corpus compilation the popularity of historically oriented corpora
is shown by the fact that three papers report on the construction of such corpora.
Kytö writes on the Corpus of Early American English, which is meant to supplement
the Helsinki Corpus of English Texts, Lancashire reports on the Toronto-based Early Modern
English Renaissance Dictionaries Corpus, while Wright's topic is the Cambridge Corpus of
Early Modern English (1600-1800). Special varieties of contemporary English are contained
in the corpora discussed by Fang, Granger, and Collot and Belmore. Fang's paper is a progress
report on the compilation, carried out in Hong Kong, of a corpus of computer science texts.
Collot and Belmore introduce a corpus of a whole new language medium: `electronic language'
as they call it, that is, the English of e-mail messages. Granger discusses her corpus of
non-native, learner English, which is being compiled in the ICLE project (International
Corpus of Learner English), a companion project of the ICE project. The British National
Corpus is one of the recent undertakings aiming at the construction of an extremely large
corpus. Burnage and Dunlop's paper provides an insight into the many practical problems
that arise in a project of this size and also discusses some aspects of the question of
standardization. Blackwell discusses the problems a corpus linguist has to cope with when
faced with an influx of millions of words every month. Knowles, in the last paper of this
section, deals with a rather different problem; that of converting the Lancaster/IBM Spoken
English Corpus into a speech database known as MARSEC (Machine Readable Spoken English Corpus).
The corpus analysis section opens with a survey given by Eyes and Leech of the projects
currently going on at UCREL, the Unit for Computer Research on the English Language at
Lancaster University. They pay special attention to the role of the human analyst in
their tagging and parsing system and discuss the question how quality control should
be realized in such a system. Van Halteren and Oostdijk describe the TOSCA analysis
system developed at the University of Nijmegen, which was demonstrated at the Nijmegen
ICAME Conference. Briscoe and Waegner address the problem of undergeneration, i.e. the
fact that a grammar-based parser may not yield the contextually correct structure for
some sentences in a corpus. They emphasize that a solution to this problem should be
both robust and domain-independent. For any analysis system a grammar workbench is a
much-needed if not indispensable tool for grammar writing. In their article, Nederhof
and Koster discuss the workbench developed for grammars written in the AGFL (Affix
Grammar over Finite Lattices) formalism and the ideas underlying it. AGFLs are used
in the TOSCA analysis system. In the last paper of this section, Souter compares five
parsed English corpora with full grammatical annotation and concludes that there is a
serious need of prototype standards; these should accommodate current practice in
corpus analysis and not make the analyst's job more difficult than it already is.
In the last section, various aspects of corpus exploitation are discussed: tools, aims
and results. Quinn discusses the Corpus Utility Program (ICECUP) developed within the
context of the ICE project, outlining the linguistic requirements such a tool has to
meet and the software principles on which ICECUP is based. Concordance programs to
investigate collocational patterns have been around for some time, but really huge
corpora are a comparatively recent phenomenon; Collier shows that the increase in
size imposes a different design on such programs and discusses what can be done about
it. That for some aims of corpus exploitation the availability of huge corpora is a
prerequisite is apparent from Barkema's corpus-based study of degrees of idiomaticity.
He has found that even a 20-million-word corpus yielded insufficient data for such a
topic. One type of corpus that has now become a reality is the monitor corpus. Monitor
corpora are needed if one wants to investigate what new words are finding their way into
the English language. Renouf reports on such research, which is now taking place in the
Research and Development Unit for English Studies of the University of Birmingham. In
two further papers the results of corpus exploitation are presented. Altenberg investigates
the frequency and use of recurrent verb-complement constructions in the London-Lund
Corpus of Spoken English, while Peters presents the findings of a study of some points
of disputed usage on the basis of the Brown and LOB corpora. In the last paper of this
volume Meijs discusses the use of a computerized lexical knowledge system for the
analysis of nominal compounds.
Schils, E. & P. de Haan. Characteristics of sentence length
in running text, in Literary & Linguistic
Computing, 8: 20-26.
This article reports on the examination of the distribution of sentence lengths in a
number of running texts. More specifically, the question we try to find an answer to
is what the relation is between the lengths of adjacent sentences.
We start our examination on the basis of the hypothesis that authors may aim at
an alternation of long and short sentences. The results of autocorrelation scores
of sentence lengths, however, show that in none of five corpus texts that we
examined alternation of sentence lengths is a consistent feature. Rather the
opposite is true: two adjacent sentences are more likely to have similar lengths.
This is especially true in the fiction texts.
We then discuss two different ways of assessing the degree to which adjacent
sentences in running texts are more likely to have equal lengths than any random
selection of two sentences, viz. the mean squared jumps (i.e. a measure derived
from the difference in length between two adjacent sentence) and the autocorrelation of jumps.
Finally, we test whether within a text there are fragments where alternation of
sentence lengths might occur even though the text as a whole was characterized,
by the other tests, as displaying no alternation. This is done by looking at consecutive
fragments of 15 adjacent sentences, and relating the number of fragments with a certain
degree of negative autocorrelation with the number of fragments that can be expected to
have a negative autocorrelation under randomness. The two fiction texts are shown to have
more fragments with a negative autocorrelation than can be expected under a random sequence
hypothesis. A number of patterns for these fragments are established. They are exemplified
in the last section.
de Haan, P. The optimum corpus sample size?, in Leitner, G.
(ed.), New dimensions in English language corpora.
Methodology, results, software development. 3-19.
Berlin - New York: Mouton de Gruyter.
Numerous corpus studies have been carried out in the past two to three decades,
many of them on the standard corpora of American and British English (Brown and LOB),
both consisting of 500 samples of 2000 words each. One element that has so far not
been given much attention is the possible effect of the size of the corpus samples
on the research results. As the corpora were assumed to represent a broad cross-section
of English, and the focus originally was mainly on frequency and distribution of lexical
items, the samples were considered to be sufficiently large to yield reliable results.
More recently corpus linguists' interest has gone beyond the lexical level to the
level of syntactic description. The compilation of corpora, in the mean time, has
continued very much in the style of the first two corpora, implying that they still
usually consist of samples of 2000 words. The question arises whether samples of 2000
words are sufficiently large to yield reliable information on the frequency and distribution
of syntactic structures.
Experience with samples of 20,000 words has shown that on the whole these are sufficiently
large to yield statistically reliable results on frequency and distribution (cf. e.g. de
Haan 1989: 50-51), but even they are sometimes too small, especially when complex interactions
are studied (cf. e.g. de Haan & van Hout 1986: 89), simply because the number of
observations yielded is too small. The conclusion seems to be that the suitability of
the sample depends on the specific study that is undertaken, and that there is no such
thing as the best, or optimum, sample size as such.
An experiment was conducted, aiming to assess the impact of reducing the sample
size on the outcome of certain statistical analyses. The conclusion is that there
are certainly studies that can only be successfully carried out on the basis of larger
samples. This is not necessarily only related to the number of observations, but also
to the nature of the phenomenon studied. This would appear to be especially true for
the study of complex interactions, which requires more sophisticated statistical techniques.
Reference
Haan, P. de and R. van Hout (1986) Statistics and corpus analysis: A loglinear analysis of syntactic constraints on postmodifying clauses, in Aarts, J. and W. Meijs (eds.), Corpus linguistics II, (J. Aarts en W. Meijs eds.). Amsterdam: Rodopi. 79-98.
de Haan, P. On the exploration of corpus data by means of
problem-oriented tagging: Postmodifying clauses in the English
noun phrase, in Johansson, S. & A.-B. Stenström (eds.),
English computer corpora. Selected papers and research
guide. 51-65. Berlin - New York: Mouton de Gruyter.
This article is a report on some of the results of a project that was completed in 1989.
The aim of the project was to give a detailed description of a number of syntactic properties
of postmodifying clauses (henceforth PMCs) in the English noun phrase, and to look at
the way in which some of these properties are related to each other.
As the study is based on an examination of corpus examples, only surface structures
have been considered. The noun phrase is described basically in terms of four constituents:
| Determiner | Premodifier | Head | Postmodifier |
de Haan, P. TOSCA and beyond, in Jones, L.M. (ed.),
Using Corpora. Proceedings of the seventh annual
conference of the university of Waterloo centre for the new
OED and text research. 92-109. Waterloo: UW Centre for
the New OED and Text Research.
This paper presents a brief overview of the activities that were undertaken by the
TOSCA Research Group in the 1980s. The emphasis in Nijmegen has always been on the
syntactic analysis of corpus data.
The first project in which the TOSCA group was involved entailed manual tagging of
the data and a subsequent automatic syntactic analysis on the basis of a context-free
grammar. The experience gained in this project led us to adopt a different approach,
viz. of making the analysis an automated process, on the basis of the formalism of
the Extended Affix Grammar. This formalism was used in various projects, involving
different languages. A relatively large corpus of British English texts was compiled,
especially designed for the study of language variation. For the exploration of the
analysed material a linguistic database was designed, containing the analysed sentences
in the form of tree structures.
The paper concludes with the presentation of two projects in which specific
applications of corpus research are discussed. One of them aimed at a quantitative
description of the most common syntactic structures in English, while the other was
concerned with characteristics of running text.
de Haan, P. Postmodifying clauses in the
English noun
phrase. A corpus-based study. Amsterdam: Rodopi.
This book reports on the results of a project whose aim was to give a detailed description of a number of syntactic properties of postmodifying clauses in the English NP, and study the way in which some of these properties are related to each other. The study is base on an examination of corpus data. The structure of the NP is described basically in terms of four constituents:
| Determiner | Premodifier | Head | Postmodifier |
de Haan, P.
A corpus investigation into the behaviour of prepositional verbs, in Kytö, M.,
O. Ihalainen & M. Rissanen (eds.), Corpus linguistics, hard and soft. 121-135.
Amsterdam: Rodopi.
This article presents a report on an investigation into the
behaviour of so-called prepositional verbs in English. Three
aspects are looked into, viz. the occurrence of prepositional
verbs in passive verb phrases, the occurrence of adverbial
insertion between the verb and the preposition, and the
position of the preposition in relative clauses.
It is argued that the fact that combinations of verbs and
prepositions occur in passive verb phrases is the only
justification for their classification as prepositional verbs
in English. Nevertheless, the question remains whether within
the group of prepositional verbs a distinction can be made
between combinations that are more close and those that are
less close. This distinction may be visible in their behaviour
in the three syntactic environment mentioned above.
The investigation of a small amount of corpus data shows us
that there are certain tendencies in this respect. However, it
appears that a larger amount of corpus data would have to be
examined before we can draw any definite conclusions.
de Haan, P.
Relative clauses in indefinite noun phrases, in English Studies, 68: 171-189.
The discussion of relative clauses is often limited to that of
relative clauses in definite NPs. Some attention is usually
paid, especially in introductory works on practical English
grammar, to the distinction between restrictive and non-
restrictive clauses, but hardly any attention is paid to the
occurrence of relative clauses in indefinite NPs.
This article makes a contribution to the insight, within
descriptive linguistics, into the factors that influence the
occurrence of relative clauses in indefinite NPs. The purpose
of the article is twofold. It shows that, in general, the
function of modifiers in indefinite NPs is different from that
of modifiers in definite NPs. It also shows that in the case of
relative clauses there are a number of additional factors
influencing their occurrence in definite and indefinite NPs.
In particular, the NP head, its reference and function in the
superordinate structure play a role. Also the relative pronoun
and the clause pattern in the relative clause itself are seen
to be affected by the indefinite nature of the NP.
de Haan, P.
Exploring the linguistic database: Noun phrase complexity and language variation, in
Meijs, W. (ed.), Corpus linguistics and beyond. 151-165. Amsterdam: Rodopi.
Since Aarts (1971) no attempt has been made to demonstrate the
relationship between the distribution of NP type and text
variety, although Aarts hinted at the existence of such
relationships. A more complex analysis of corpus material to
investigate this relationship has recently become possible
thanks to the Linguistic DataBase and the loglinear analysis.
An analysis of the written part of the Nijmegen corpus (app.
120,000 words) shows the existence of complex relationships
between NP type, NP function and text variety. Four NP types
are distinguished, ranging from basic to complex, while three
NP functions are distinguished: subject, direct object and
prepositional complement. The two text types distinguished are
fiction and non-fiction. The relationships between these
variables are shown to be highly significant.
Apart from the outcome of the statistical analysis of the data,
the article provides an excellent example of the interactive
use of the Linguistic DataBase.
Reference:
Aarts, F. 1971. On the distribution of noun-phrase types in English clause-structure, in Lingua, 26: 281-293.
![]()

This page is maintained by Pieter de Haan
<P dot deHaan at-sign let dot ru dot nl>
|