(Last modified: November 11, 2010)
my logo my picture

Dr. Pieter de Haan

Publication abstracts:

2008 2007 2006 2005
2004 2002 2001 2000 1999 1998 1997 1996
1995 1994 1993 1992 1991 1989 1988 1987

2008

de Haan, P. & K. van Esch. Measuring and assessing the development of foreign language writing competence, in Porta Linguarum, 9: 7-21.

This paper discusses the development of writing skills of Dutch students of English and Spanish as foreign languages. Essays written in three consecutive years were analyzed for essay length, word length, and type/token ratio - reflecting linguistic competence. A selection of essays was analyzed for argument structure and the use of cohesive devices. These same essays were ranked holistically by experienced lecturers.
Students develop linguistic and discourse competences, but differ according to language, proficiency level, and year of study. Assessors' arguments for ranking are related mainly to the students' linguistic competence. The implications of our findings for research and teaching are discussed.
Key words: foreign language writing, discourse competence, writing development, assessment.

de Haan, P. & K. van Esch. Measuring and assessing the development of foreign language writing competence, in Porta Linguarum, 9: 7-21.

En este artículo se describe el desarrollo de la competencia de escribir de alumnos universitarios de inglés y de español como lenguas extranjeras en los Países Bajos. Ensayos escritos en tres años consecutivos fueron analizados en cuanto al tamaño de los ensayos, el tamaño de las palabras y el type/token ratio, aspectos que reflejan la competencia lingüística. Una selección de estos ensayos fue analizada en cuanto a la competencia discursiva y el uso de mecanismos de cohesión. Los mismos ensayos fueron evaluados y ordenados holístamente por profesores expertos.
Resulta que los alumnos desarrollan sus competencias lingüística y discursiva pero que se diferencian según la lengua que estudian, su nivel de competencia y su año de estudio. Los argumentos de los evaluadores están relacionados sobre todo con el nivel de competencia lingüística de los alumnos. Se discuten las implicaciones de estos resultados para la investigación y la enseñanza.
Palabras clave: escritura en lengua extranjera, competencia discursiva, desarrollo de la escritura, evaluación

2007

de Haan, P. English writing by Dutch-speaking students, in Teubert, W. & R. Krishnamurthy (eds.) Corpus Linguistics (Critical Concepts in Linguistics). Vol VI. London - New York: Routledge. 92-101..

In two studies (de Haan, 1997; de Haan, 1998) I have reported on experimental work involving the comparison of advanced learners' writing with native students' writing, in an attempt to gain more insight in the syntactic behaviour of advanced learners of English. Both studies were based on the investigation of tagged versions of four subsets of the ICLE corpus (cf. Granger 1993). Sequences of two, three or four tags were taken to be indicative of syntactic structures. A quantitative study of these sequences revealed certain differences, but also similarities between the native students and the learners on the one hand, and among the three groups of learners on the other.
In the current article I make a comparison between the Dutch-speaking students and the native students, with a view to assessing the quality of the Dutch-speaking students' writing. Dutch and English are closely related languages and Dutch-speaking students of English have traditionally been expected to attain a "near-native" level of oral and written proficiency of English at universities both in The Netherlands and in Belgium.
The study of the ICLE data on the microscopic and on the macroscopic level both suggest that differences in syntactic behaviour between Dutch and native students are due to a less extensive command of vocabulary by the Dutch learners; a less extensive command of vocabulary will inevitably lead to a less informative text, giving it a less sophisticated appearance. It is this aspect of writing that, more than the mechanics of the production of grammatically correct sentences, needs to be given more attention.

Reference

Granger, Sylviane. 1993. International corpus of learner English, in Aarts, J., P. de Haan and N. Oostdijk (eds.): English language corpora: Design, analysis and exploitation. Amsterdam: Rodopi. 57-69.

de Haan, P. & K. van Esch. Assessing the development of foreign language writing skills: Syntactic and lexical features, in Fitzpatrick, E. (ed.) Corpus Linguistics Beyond the Word: Corpus Research from Phrase to Discourse. 185-202. Amsterdam - New York, NY: Rodopi.

In de Haan & van Esch (2004; 2005) we outline a research project designed to study the development of writing skills in English and Spanish as foreign languages, based on theories developed, for instance, in Shaw & Liu (1998) and Connor & Mbaye (2002). This project entails collecting essays written by Dutch-speaking students of English (EFL writing) and Dutch-speaking students of Spanish (SFL writing) at one-year intervals, in order to study the development of their writing skills, both quantitatively and qualitatively. The essays are written on a single prompt, taken from Grant & Ginther (2000), asking the students to select their preferred source of news and give specific reasons to support their preference. Students' proficiency level is established on the basis of holistic teacher ratings.
A first general analysis of the essays has been carried out with WordSmith Tools. Moreover, the texts have been computer-tagged with Biber's tagger (Biber, 1988; 1995). An initial analysis of relevant text features (Polio, 2001) has provided overwhelming evidence of the relationship between a number of basic linguistic features and proficiency level (de Haan & van Esch, 2004; 2005).
In the current article we present the results of more detailed analyses of the EFL material collected from the first cohort of students in two consecutive years, 2002 and 2003, and discuss a number of salient linguistic features of students' writing skills development. We first discuss the development of general features such as essay length, word length and type/token ratio. Then we move on to discuss how the use of specific lexical features (cf. Biber, 1995; Grant & Ginther, 2000) has developed over one year in the three proficiency level groups that we have distinguished. While the development of the general features over one year is shown to correspond logically to what can be assumed to be increased proficiency, the figures for the specific lexical features studied do not all point unambiguously in the same direction. References

Biber, D. 1988. Variation across speech and writing. Cambridge: Cambridge University Press.

Biber, D. 1995. Dimensions of register variation: A cross-linguistic comparison. Cambridge: Cambridge University Press.

Connor, U. & Mbaye, A. 2002. Discourse approaches to writing assessment, Annual Review of Applied Linguistics, 22: 263-278.

Grant, L. & Ginther, A. 2000. Using computer-tagged linguistic features to describe L2 writing differences, Journal of Second Language Writing, 9: 123-145.

Polio, C. 2001. Research methodology in L2 writing assessment, in T. Silva & P. K. Matsuda (eds.), On Second Language Writing. Mahwah, NJ: Lawrence Erlbaum Associates. 91-115.

Shaw, P. & Liu, E. 1998. What develops in the development of second-language writing?, Applied Linguistics, 19: 225-254.

2006

van Esch, K., P. de Haan, L. Frissen, I. González Santero & A. de la Torre Miranda. Evolución en la competencia escrita de estudiantes de español como lengua extranjera, in Estudios de Lingüística Aplicada, 24, 43: 55-76.

Este estudio continúa las investigaciones de van Esch, de Haan & Nas y de Haan & van Esch sobre el desarrollo de la escritura en inglés y español como lenguas extranjeras y de Teijeira Rodríguez, van Esch & de Haan sobre la estructura, la coherencia y la cohesión en ensayos escritos en español por estudiantes holandeses y estudiantes nativos de español. La pregunta central de este estudio es si se produce en la calidad de los ensayos escritos por no nativos una evolución significativa hacia características que se consideran propias de textos escritos por nativos de español. Se termina esbozando sugerencias para futuros campos de estudio y apuntando algunas aplicaciones prácticas para la enseñanza del español como lengua extranjera.

van Esch, K., P. de Haan, L. Frissen, I. González Santero & A. de la Torre Miranda. Evolución en la competencia escrita de estudiantes de español como lengua extranjera, in Estudios de Lingüística Aplicada, 24, 43: 55-76.

This study is a continuation of the studies by van Esch, de Haan & Nas and de Haan & van Esch on the development of English and Spanish foreign language writing, and by Teijeira Rodríguez, van Esch & de Haan on the structure, coherence and cohesion in Spanish essays written by Dutch university students and by native Spanish students. The central question in the current paper is whether there is a significant development in the writing produced by non-native students in the direction of what is considered typical of native Spanish texts. The paper is concluded with suggestions for further research in this area and practical applications for the teaching of Spanish as a foreign language.

2005

Teijeira Rodríguez, M., K. van Esch & P. de Haan. La coherencia y la cohesión en textos escritos por estudiantes neerlandeses de español como LE, in Estudios de Lingüística Aplicada, 23, 41: 67-100.

En este artículo se presentan los resultados del análisis de la coherencia y la cohesión de un corpus de textos de hablantes nativos de español y otro corpus de textos escritos por holandeses en español como lengua extranjera (LE).
Los textos analizados forman parte del conjunto de textos recopilados en el marco de una investigación más amplia que se realiza en la Universidad de Nimega entre 2002 y 2008 sobre la escritura de ensayos argumentativos en inglés LE y en español como LE y lengua materna (L1), cuyos primeros resultados fueron publicados en 2004 en Estudios de Lingüística Aplicada por van Esch, de Haan y Nas y que conciernen principalmente los aspectos cuantitativos de los textos.
En este artículo ofrecemos los resultados de los análisis desde la perspectiva cualitativa en relación a las variables coherencia y cohesión. En la primera parte se tratan aspectos discursivos de la escritura en L2 y LE. En la segunda parte se presentan la metodología utilizada y los resultados de los estudios que hemos realizado hasta el momento.

Teijeira Rodríguez, M., K. van Esch & P. de Haan. La coherencia y la cohesión en textos escritos por estudiantes neerlandeses de español como LE, in Estudios de Lingüística Aplicada, 23, 41: 67-100.

This article presents the results of the analysis of coherence and cohesion in a corpus of native Spanish texts and a corpus of texts written by Dutch students of Spanish as a foreign language (FL).
The texts analyzed are part of a group of texts collected for a larger research project carried out at the University of Nijmegen between 2002 and 2005 about the writing of argumentative essays in English as a FL and in Spanish as a foreign language and as a native language. The first results are published in 2004 in Estudios de Linguística Aplicada by van Esch, de Haan and Nas and are concerned primarily with quantitative aspects of the texts.
In this article we present the results of the analyses from a qualitative perspective in relation to the variables coherence and cohesion. The first part discusses discursive aspects of writing. In the second part we present the methodology adopted, and the results of the analyses that have been carried out so far.

de Haan, P. & K. van Esch. The development of writing in English and Spanish as foreign languages, in Assessing Writing, 10: 100-116.

This article presents the first results of the study of argumentative essays in English as a foreign language and in Spanish as a foreign language and as a native language, carried out at Radboud University Nijmegen between 2002 and 2008. The aim of the project is to compare teachers' holistic assessments with the results of the quantitative analysis of syntactic and lexical features of the essays and measure the development in writing skill from one year to the next. The first part of the current article discusses general aspects of native language and foreign language writing, and establishes the place of our project in the discipline. In the second part we present the methodology adopted, and the results of the quantitative analyses that have been carried out so far.

2004

de Haan, P. & K. van Esch. Towards an instrument for the assessment of the development of writing skills, in Connor, U. and Th. Upton (eds.) Applied Corpus Linguistics: A Multidimensional Perspective. 267-279. Amsterdam - New York, NY: Rodopi.

An important aspect of academic foreign language writing courses is assessing and grading the quality of students' writing products. This can be done by using holistic or analytical scales or by ranking. What is needed for the specific Dutch context is an instrument geared towards the specific objectives and context of our foreign language courses, which can help the teacher to assess students' written products with more validity and which can be used to assess students' progress over time. A joint project, aiming at developing such an instrument for the specific Dutch context, has recently started at the departments of English and Spanish in Nijmegen, in The Netherlands.
The present article describes the first step towards developing the above-mentioned instrument: the set-up of two modest-sized "longitudinal" learner corpora, one for Spanish and one for English. These corpora will contain learner essays written under controlled conditions and on pre-defined topics. The first batch of student essays was collected in March 2002. Lexical and syntactic analyses of these essays will provide a unique insight into the development of the students' writing skills.
A first general quantitative analysis of the essays has already yielded a number of interesting observations. The article concludes with a tentative suggestion for a more elaborate instrument to relate student performance to teacher assessment.

van Esch, K., P. de Haan & M. Nas. El desarrollo de la escritura en inglés y español como lenguas extranjeras, in Estudios de Lingüística Aplicada, 22, 39: 53-79.

En este artículo se presentan los primeros resultados de un proyecto de investigación realizado en la universidad de Nijmegen entre 2002 y 2005 sobre la escritura de ensayos argumentativos en inglés como lengua extranjera (LE) y en español como LE y lengua materna (L1). Los fines de este proyecto son comparar valoraciones holísticas con los resultados de análisis de características sintácticas y léxicas de los ensayos, medir el desarrollo de los ensayos de un año a otro e investigar el proceso de feedback y la revisión de textos. En la primera parte se tratan aspectos importantes de la escritura como la competencia de escribir, las diferencias y semejanzas entre la escritura en L1 y L2 y las maneras de analizar las características de textos en L2 y LE. En la segunda parte se presentan la metodología utilizada y los resultados de los proyectos llevados a cabo hasta el momento.

van Esch, K., P. de Haan & M. Nas. El desarrollo de la escritura en inglés y español como lenguas extranjeras, in Estudios de Lingüística Aplicada, 22, 39: 53-79.

This article presents the first results of a research project on the writing of argumentative essays in English as a foreign language and in Spanish as a foreign language and as a native language, carried out at the University of Nijmegen between 2002 and 2005. The object of the project is to compare teachers' holistic assessments with the results of the analysis of syntactic and lexical features of the essays, measure any development in writing skill from one year to the next, and to study the process of feedback and text revision. The first part discusses major aspects of writing, such as general writing competence, differences between native language writing and second and foreign language writing, and ways to analyse text features in second and foreign language texts. In the second part we present the methodology adopted, and the results of the projects that have been carried out so far.

2002

de Haan, P. The non-nominal character of spoken English, in Breivik, L.E. and A. Hasselgren (eds.) From the COLT's mouth ... and others'. 59-69. Amsterdam - New York, NY: Rodopi.

This article seeks to find confirmation for the claim (cf. Biber et al. 1998; Biber et al. 1999; de Haan 2001) that the syntactic differences between spoken and written English virtually all point to the same conclusion, viz. that the written variety has a strong nominal character, whereas the spoken variety has a strong verbal, or clausal character. In other words, the noun phrase, with its noun phrase functions, is the typical central unit of the structure of written English, whereas the clause, with its clause functions, is a far more typical unit of the structure of spoken English.
This article is based on the assumption that the non-nominal character of spoken English is shown in the relative absence of nouns ending in typical nominalisation suffixes like -ness, -ity, -ance, -ation, etc., as well as in the differences in syntactic make-up between NPs centred around these nouns in spoken and written English. The data have been collected from the BNC sampler CD-ROM, which comprises 1 million words of spoken English and 1 million words of written English.
It is shown that there is a cline from informal spoken language to informative writing, in that the non-nominal character of spoken English is most outspoken in informal texts, while more formal spoken interactions have a more nominal character than imaginative writing. Informative writing has the strongest nominal character.

References

Biber, D., S. Conrad and R. Reppen. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.

Biber, D., S. Johansson, G. Leech, S. Conrad and E. Finegan. 1999. Longman grammar of spoken and written English. Harlow: Longman.

de Haan, P. Whom is not dead?, in Peters, P., P. Collins and A. Smith (eds.) New Frontiers of Corpus Research. Amsterdam - New York, NY: Rodopi.

This article reviews some of the recent literature on the ongoing discussion about the use of WHOM in spoken and written English. Some linguists claim that WHOM is artificially kept alive by prescriptive grammarians, and that is has virtually disappeared from the spoken language.
A corpus investigation on the occurrence of WHOM in a number of recent English corpora yields the following:

  1. WHOM is certainly not dead; it is not even dying.
  2. There is a certain amount of variation across text categories: on the whole it is especially the more formal text types where WHOM occurs relatively frequently, but it is by no means limited to writing.
  3. Its chief function is that of complement to a preposition. It is definitely the form to use when it is immediately preceded by the preposition it complements.
  4. A prominent role is played by of whom functioning as postmodifier in an NP headed by a numeral or a quantifier. These make up almost one fifth of the total number of occurrences of WHOM.
On the basis of these findings the author ventures the following predictions:
  1. WHOM will probably gradually disappear as an object relative pronoun, first in restrictive clauses, but ultimately also in non-restrictive clauses;
  2. WHOM is here to stay as a prepositional complement of a non-stranded preposition, particularly in non-restrictive clauses, where the relativizers THAT and ZERO are not available;
  3. the construction X of whom, where X stands for a numeral or a quantifier, will become the single most frequent syntactic structure involving WHOM.

2001

de Haan, P. Aspects of the syntax of spoken English, in Aijmer, K. (ed.) A wealth of English. 47-56. Gothenburgh: Gothenburgh University Press.

This article presents the preliminary results of a study into a small number of syntactic characteristics of spoken language, based on the occurrence of frequent word class sequences. I have recently also used this technique in two experimental studies of English learner data (de Haan, 1997; de Haan, 1998).
This study confirms that very crude syntactic differences between spoken and written data can be established very easily on the basis of the occurrence of tags and tag sequences.
Secondly, this study shows quite clearly that spoken English has a tendency to be more "clausal" than written language, which is more "nominal". This tendency is clear not only from the inspection of the occurrence of the single word classes, but especially on the word class sequences. As the sequences become longer they present more detailed syntactic structures which confirms the findings on the basis of the shorter sequences.

2000

de Haan, P. Tagging non-native English with the TOSCA-ICLE tagger, in Mair, C. and M. Hundt (eds.) Corpus linguistics and linguistic theory. 69-79. Amsterdam: Rodopi.

The TOSCA-ICLE Tagging Unit (TU) has been in use for some time now to tag (part of) the material in several research centres participating in the ICLE project. The TU comprises among other things an automatic tagger, and a tag selection program, which can be used to correct tagger output (the tagger has a success rate of about 95 per cent with non-native material).
The tagger has been derived from the TOSCA tagger, which was originally designed for tagging error-free native English (as most taggers are), and was consequently trained on native material. Applying the TOSCA-ICLE tagger to the ICLE material presents us with a number of problems that can be said to be unique to learner material. No guidelines have been proposed in the "ICLE community" so far, as to how to deal with these problems. Guidelines would be needed, however, for those ICLE members who wish to apply the Tag Selection Program to produce error-free tagged versions of their ICLE material.
This article illustrates the kinds of errors learners are apt to make, on the basis of experience with Czech, Dutch and Spanish ICLE material. These include keyboard errors, spelling errors, lexical and grammatical errors. A tentative classification of these errors is presented, as well as a proposed way of dealing with them.

1999

de Haan, P. English writing by Dutch-speaking students, in Hasselgård, H. and S. Oksefjell (eds.) Out of Corpora. 203-212. Amsterdam: Rodopi.

In two recent studies (de Haan, 1997; de Haan, 1998) I have reported on experimental work involving the comparison of advanced learners' writing with native students' writing, in an attempt to gain more insight in the syntactic behaviour of advanced learners of English. Both studies were based on the investigation of tagged versions of four subsets of the ICLE corpus (cf. Granger 1993). Sequences of two, three or four tags were taken to be indicative of syntactic structures. A quantitative study of these sequences revealed certain differences, but also similarities between the native students and the learners on the one hand, and among the three groups of learners on the other.
In the current article I make a comparison between the Dutch-speaking students and the native students, with a view to assessing the quality of the Dutch-speaking students' writing. Dutch and English are closely related languages and Dutch-speaking students of English have traditionally been expected to attain a "near-native" level of oral and written proficiency of English at universities both in The Netherlands and in Belgium.
The study of the ICLE data on the microscopic and on the macroscopic level both suggest that differences in syntactic behaviour between Dutch and native students are due to a less extensive command of vocabulary by the Dutch learners; a less extensive command of vocabulary will inevitably lead to a less informative text, giving it a less sophisticated appearance. It is this aspect of writing that, more than the mechanics of the production of grammatically correct sentences, needs to be given more attention.

Reference

Granger, Sylviane. 1993. International corpus of learner English, in Aarts, J., P. de Haan and N. Oostdijk (eds.): English language corpora: Design, analysis and exploitation. Amsterdam: Rodopi. 57-69.

1998

de Haan, P. How 'native-like' are advanced learners of English?, in Renouf, A. (ed.), Explorations in corpus linguistics. 55-65. Amsterdam: Rodopi.

In this article an attempt is made at assessing the "near-nativeness" of advanced learners of English with respect to the syntactic structures that they use in their written production. This is done on the basis of a study of part of three learner corpora in the ICLE project, and part of the native student control corpus, which is also part of the ICLE project.
Although there is, as yet, no syntactically analysed learner corpus material available, some insight into syntactic structuring may be obtained from studying sequences of word class tags. The material studied has been tagged with an experimental version of the new TOSCA-ICLE tagger/lemmatizer. The tagged output was used to extract information about tag sequences.
It is shown that on the whole advanced learners' writing is not easy to distinguish from native students' writing with respect to the syntactic structures that have been established via the tag sequences. It is suggested that the differences that are observed between the advanced learners and the native students may be due to a lack of lexical knowledge of the learners, which leads to a difference in the way they use certain syntactic structures. The conclusion is that differences between learner and native material are more subtle than can be established on the basis of a study of tag sequences and that for a better comparison we would indeed need to use syntactically fully analysed material.

1997

de Haan, P. On the use of rank-frequency distributions, in Fries, U., V. Müller & P. Schneider (eds.), From Ælfric to the New York Times. 125-137. Amsterdam: Rodopi.

This article reports on two experiments on rank-frequency distributions. First an attempt is made at using rank-frequency distributions as a way of identifying different authors in a collection of texts, by means of correspondence analysis (ANACOR) plots.
Next, the problems involved in determining the relationship between a population and samples taken from this population are gone into. It is shown that a rank-frequency distribution of a sample cannot be taken as a reliable estimate of the rank-frequency distribution of the population from which that sample has been taken.

de Haan, P. An experiment in English learner data analysis, in Aarts, J. I. de Mönnink & H. Wekker (eds.), Studies in English language and teaching. 215-229. Amsterdam: Rodopi.

This article presents the first results of an attempt at analysing advanced English learners' performance data. Learner data of students of English with three different native languages are compared with each other and with native students data. The aim is to establish typical learner usage patterns. This is done on the basis of word class combinations.
The data examined are part of ICLE, the International Corpus of Learner English. Four sets of data are examined: Dutch, Finnish, French and native students data. It is shown that it is not possible to base any firm conclusions on the basis of word class combinations, and that fully syntactically analysed material would be needed to study any typical learner usage patterns.
However, while it is shown that the native students make use of a greater vocabulary range, they are shown to have a tendency to use fewer different types of tag combinations. The differences between the natives and the learners become greater as the tag combinations become bigger. This suggests that the learners, even though they are very advanced, have not incorporated into their system a typical use of grammatical patterning to the same extent as the native speakers. This may be accounted for by a greater exposure of the natives to their language.

de Haan, P. Syntactic characteristics of dialogue and non- dialogue sentences in fiction writing, in Ljung, M. (ed.) Corpus-based studies in English. 101-117. Amsterdam: Rodopi.

In de Haan (1996) it is proposed that a detailed study be carried out into intra-textual variation within fiction texts, with a view to gaining more insight into the specific syntactic characteristics of dialogue and non-dialogue sentences. Previous studies (e.g. Biber and Finegan, 1986) convincingly show that on several dimensions fiction texts take a kind of middle position between more formal writing on the one hand, and spoken (face-to-face) conversation on the other. However, no account is taken, in these studies, of any possible intra-textual differences. More recently, it has been shown (de Haan, 1992) that the length and the clause patterns of dialogue sentences differ significantly from those of non-dialogue sentences in fiction texts.
Likewise, Biber and Finegan (1994), using an MF/MD analysis based on the study of a great number of syntactic features, have studied intra-textual variation within medical research articles. They show that there are significant differences between different sections of such articles. These differences can be interpreted in the light of the different communicative functions of the various sections.
In the present article it is shown, on the basis of a study of a number of syntactic characteristics in a number of fiction texts, that the differences between dialogue and non-dialogue sentences can be described more generally in syntactic terms. Also, a number of author-specific characteristics which play a role in this are gone into.

References:

Biber, D. and E. Finegan. 1986. 'An initial typology of English text types', in Aarts, J. and W. Meijs (eds), Corpus Linguistics II. 19-46. Amsterdam: Rodopi.

Biber, D. and E. Finegan. 1994. 'Intra-textual variation within medical research articles', in Oostdijk, N. & P. de Haan (eds), Corpus-based research into language. 201-221. Amsterdam: Rodopi.

de Haan, P. and H. Van Halteren. The TOSCA-ICLE Tagset. Software Manual. 24 pp. Nijmegen: TOSCA Research Group, Department of Language and Speech.

The TOSCA-ICLE tagger/lemmatizer is a software suite for tagging English text with wordclass tags from the TOSCA-ICLE tagset, designed for the International Corpus of Learner English (ICLE). The suite contains an automatic tagger/lemmatizer, a tagged file reformatter and a program for manual tag selection. This publication is the user manual for installation and use of the software suite.

1996

de Haan, P. More on the language of dialogue in fiction, in ICAME Journal, 20: 23-40.

This paper takes Nelleke Oostdijk's (1990) article 'The language of dialogue in fiction' as its starting point. Oostdijk presents her findings in support of the claim that dialogue in fiction has its own characteristics, which makes it like, but at the same time also different from, spoken language. It is like spoken language, as it clearly reflects the author's attempt to represent spoken conversation. It is clearly unlike spoken language as it is planned, revised, and edited. Oostdijk assumes a continuous scale with planned discourse at one extreme and unplanned discourse at the other, suggesting that dialogue in fiction is found somewhere between these extremes.
Oostdijk presents an overall picture of five fiction samples from the TOSCA corpus, but suggests that a number of observations point to the existence of idiosyncratic variation among authors. It is at this point that I start, in an attempt to find out in which characteristics this idiosyncratic variation manifests itself most clearly. I look into the following aspects:

  1. reporting parts vs. reported parts
  2. clause patterns in dialogue passages
  3. sentence length in dialogue passages
  4. use of reporting verbs
  5. use of adverbials in reporting utterances
  6. clause patterns in reporting utterances
It is shown that there are considerable differences between dialogue and non-dialogue passages in seven corpus texts that were examined. At the same time, there is considerable variation among authors as to the use of reporting verbs, clause patterns and adverbials in reporting utterances. A number of suggestions for further research are given, notably the suggestion that research be carried out into syntactic patterns of spoken and written English.

Reference

Oostdijk, N. (1990), 'The language of dialogue in fiction', in: Literary and Linguistic Computing, Vol. 5: 235-241.

1995

Schils, E. & P. de Haan. Mortons Cusum ontrafeld, in Gramma/TTT, 3: 129-141.

In The Qsum Plot (Morton and Michaelson, 1990) it is claimed that cumulative sum charts (cusum charts, for short, or cusum plots) can be used to determine authorship of different text fragments or detect corruptions of texts. This is done by plotting the cusum charts of, for instance, the number of nouns of the sentences together with that of the total number of words in the sentences of a text fragment. This procedure is based on the assumption that authors have a certain, more or less constant "habit" in their use of, for instance, the number of nouns relative to the total number of words they use in the sentences they produce. It also assumes that the habits authors have are so idiosyncratic that even the slightest corruption can be detected. For instance, an insertion of five sentences by Hardy into 40 or so sentences by Scott can be readily shown.
Morton's method has been criticised recently in a number of publications. All of these express serious doubts as to the validity of the procedure. Indeed, our own corpus data suggests that there is considerable intra-author variation.
In this article, however, the main emphasis is on showing the unreliability of the cusum technique.

Reference:

Morton, A., en S. Michaelson. 1990. The Qsum plot. Internal Report CSR-3-90. University of Edinburgh: Department of Computer Science.

1994

Oostdijk, N. & P. de Haan (eds.). Corpus-based research into language. Amsterdam: Rodopi.

The topics in this book are centred around the syntactic analysis of corpora. It was decided to focus on contributions that highlight recent developments in related work in the field of corpus linguistics done elsewhere. This means that the editors have taken the research areas pursued by the TOSCA Group in Nijmegen as a lead to the thematic arrangement of the contributions in this book.

Three parts can be distinguished:

  1. The encoding and tagging of corpora
  2. Parsing and databases
  3. Linguistic exploration of the data

de Haan, P. & E. Schils. The Qsum plot exposed, in Fries, U., G. Tottie & P. Schneider (eds.), Creating and using English language corpora. 93-105. Amsterdam: Rodopi.

In the Qsum Plot (Morton and Michaelson, 1990) it is claimed that cumulative sum charts, (cusum charts, for short, or cusum plots) can be used to determine authorship of different text fragments or detect corruptions of texts. This is done by plotting the cusum charts of, the number of nouns of the sentences together with that of the total number of words in sentences of a text fragment. This procedure is based on the assumption that authors have a certain, more or less constant `habit' in their use of, for instance, the number of nouns relative to the total number of words they use in the sentences they produce. It also assumes that the habits authors have are so idiosyncratic that even the slightest corruption can be detected. For instance, an insertion of five sentences by Hardy into 40 or so sentences by Scott can be readily shown.
Morton's method has been criticised recently in a number of publications. All of these express serious doubts as to the validity of the procedure. Indeed, our own corpus data suggest that there is considerable intra-author variation.
In this article, however, the main emphasis is on showing the unreliability of the cusum technique.

Reference

Morton, A. and Michaelson, S. (1990), `The Qsum plot', in: Internal report CSR-3-9. University of Edinburgh: Department of Computer Science.

Oostdijk, N. & P. de Haan. Clause patterns in modern British English. A corpus-based (quantitative) study, in ICAME Journal, 18: 41-79.

In this article a research project is discussed which is carried out by the TOSCA Research Group. It aims to provide a survey of the frequency of occurrence and the distribution of a range of syntactic structures in Modern British English. The project makes use of the Nijmegen Corpus. This computerised corpus, comprising approximately 130,000 words, has undergone a detailed syntactic analysis and is available for explorative studies. This article reports on findings with regard to the clause patterns encountered in the material.

de Haan, P. Noun phrase structure as an indication of text variety, in Anglistik & Englischunterricht, 49: 85-106. *(Special issue: The noun phrase in English: Its structure and variability).

This article discusses the relationships that exist among the type and structure of the noun phrase and its function and the text type in which it occurs. The discussion is based on an examination of the distribution of various noun phrase types in different functions in a collection of texts, the Nijmegen Corpus.
It is demonstrated that the functions of the noun phrases, as well as their combinations, play an important role in the distribution of noun phrase types and noun phrase structures. In addition, the type of text contributes to this distribution.
Moreover, on the level of the noun phrase postmodifier, it is demonstrated that both the actual number of postmodifiers and the different kinds of postmodifiers are distributed differently in the texts examined.
The general conclusion is that in the non-fiction texts more complex noun phrases occur than in the fiction texts. This difference is especially noticeable in subject noun phrases. However, underlying this more general tendency there are others, some of which reinforce this general tendency, whereas others partly cancel it. Although the differences in distribution are not dramatic, they are statistically significant and would warrant the conclusion that noun phrase structure is an indication of text type.

1993

de Haan, P. Sentence length in running text, in Souter, C. & E. Atwell (eds.), Corpus-based computational linguistics. 147-161. Amsterdam: Rodopi.

This article is an attempt to determine what distinguishes running text from random collections of sentences. In an experiment at reducing a 20,000 word sample to smaller stretches of words, de Haan (1992) showed that:

  1. Sentence length is not randomly distributed over the 20,000 words in the fiction texts in the Nijmegen corpus.
  2. This finding is related to the fact that the proportion of dialogue sentences in the consecutive stretches of 2000 words varies greatly: on the whole, the stretches with more dialogue sentences have a smaller average than those with fewer dialogue sentences.
  3. The relationship between the proportion of dialogue sentences and average sentence length is only an indirect one: a loglinear analysis of a three-way contingency table with the variables sentence length, sentence type (i.e. dialogue vs. non-dialogue) and sentence pattern suggests that dialogue sentences favour certain sentence patterns, which occur in shorter sentences because, on the whole, they have a simpler structure and do not, therefore, need as many words to develop as the more complex patterns that are found more generally in the non-dialogue sentences.

The purpose of this article is to make a first step towards a better understanding of the nature of the relationships that hold between consecutive sentences in a running text. The main emphasis in this article is on the sentence length. It is considered by means of a number of different statistical techniques, viz. autocorrelations, moving averages and the mean squared jump.

Aarts, J., P. de Haan & N. Oostdijk (eds.). English language corpora: Design, analysis, Exploitation. Papers from the 13th ICAME conference. Amsterdam: Rodopi.

This volume contains a selection of the papers read at the Thirteenth Conference on the Use of Computer Corpora in English Language Research (ICAME 13), which was held at Nijmegen, the Netherlands, in June 1992. The selection has been made so as to represent the three major activities involved in the use of computer corpora in linguistic research: the design and compilation of corpora, their grammatical analysis and their subsequent exploitation.
In his contribution to this volume, Gerry Knowles uses the word `antiquated' with reference to the Lancaster/IBM Spoken English Corpus - a corpus that was compiled in the mid-1980s. This is just one of many indications that things are moving fast in the discipline of corpus linguistics. A volume of papers that aims to give an impression of the state of the art, should therefore try to follow developments as closely as possible; not only in the chronological sense of trying to publish an account of the most recent research efforts but also in the sense of paying close attention to the day-to-day practice of corpus linguistic research. For that reason, many of the papers in this volume have the nature of progress reports of ongoing projects, while others discuss quite concretely the sort of problems the corpus linguist is confronted with in the everyday practice of his research.
Two general trends that can recently be observed in the international corpus linguistics community, are also reflected in this collection of papers. One of these is in the field of corpus design. Since extremely large corpora have appeared on the scene (British National Corpus, Bank of English) the focus in corpus construction has shifted away from the smaller balanced corpus of contemporary English like LOB and Brown, to corpora that either contain some special variety of the contemporary language or texts from some earlier stage of the English language. Especially the latter type of corpus, coming in the wake of the Helsinki corpus, seems to be enjoying a new popularity. Another recent feature of corpus linguistic research has been the growing concern for standardization; this concern is also audible in some of the papers in this volume.
As the title suggests, the volume is divided into three sections, each dealing with one of the major aspects of corpus linguistic research: corpus compilation and design, corpus analysis and corpus exploitation.
In the section on corpus compilation the popularity of historically oriented corpora is shown by the fact that three papers report on the construction of such corpora. Kytö writes on the Corpus of Early American English, which is meant to supplement the Helsinki Corpus of English Texts, Lancashire reports on the Toronto-based Early Modern English Renaissance Dictionaries Corpus, while Wright's topic is the Cambridge Corpus of Early Modern English (1600-1800). Special varieties of contemporary English are contained in the corpora discussed by Fang, Granger, and Collot and Belmore. Fang's paper is a progress report on the compilation, carried out in Hong Kong, of a corpus of computer science texts. Collot and Belmore introduce a corpus of a whole new language medium: `electronic language' as they call it, that is, the English of e-mail messages. Granger discusses her corpus of non-native, learner English, which is being compiled in the ICLE project (International Corpus of Learner English), a companion project of the ICE project. The British National Corpus is one of the recent undertakings aiming at the construction of an extremely large corpus. Burnage and Dunlop's paper provides an insight into the many practical problems that arise in a project of this size and also discusses some aspects of the question of standardization. Blackwell discusses the problems a corpus linguist has to cope with when faced with an influx of millions of words every month. Knowles, in the last paper of this section, deals with a rather different problem; that of converting the Lancaster/IBM Spoken English Corpus into a speech database known as MARSEC (Machine Readable Spoken English Corpus).
The corpus analysis section opens with a survey given by Eyes and Leech of the projects currently going on at UCREL, the Unit for Computer Research on the English Language at Lancaster University. They pay special attention to the role of the human analyst in their tagging and parsing system and discuss the question how quality control should be realized in such a system. Van Halteren and Oostdijk describe the TOSCA analysis system developed at the University of Nijmegen, which was demonstrated at the Nijmegen ICAME Conference. Briscoe and Waegner address the problem of undergeneration, i.e. the fact that a grammar-based parser may not yield the contextually correct structure for some sentences in a corpus. They emphasize that a solution to this problem should be both robust and domain-independent. For any analysis system a grammar workbench is a much-needed if not indispensable tool for grammar writing. In their article, Nederhof and Koster discuss the workbench developed for grammars written in the AGFL (Affix Grammar over Finite Lattices) formalism and the ideas underlying it. AGFLs are used in the TOSCA analysis system. In the last paper of this section, Souter compares five parsed English corpora with full grammatical annotation and concludes that there is a serious need of prototype standards; these should accommodate current practice in corpus analysis and not make the analyst's job more difficult than it already is.
In the last section, various aspects of corpus exploitation are discussed: tools, aims and results. Quinn discusses the Corpus Utility Program (ICECUP) developed within the context of the ICE project, outlining the linguistic requirements such a tool has to meet and the software principles on which ICECUP is based. Concordance programs to investigate collocational patterns have been around for some time, but really huge corpora are a comparatively recent phenomenon; Collier shows that the increase in size imposes a different design on such programs and discusses what can be done about it. That for some aims of corpus exploitation the availability of huge corpora is a prerequisite is apparent from Barkema's corpus-based study of degrees of idiomaticity. He has found that even a 20-million-word corpus yielded insufficient data for such a topic. One type of corpus that has now become a reality is the monitor corpus. Monitor corpora are needed if one wants to investigate what new words are finding their way into the English language. Renouf reports on such research, which is now taking place in the Research and Development Unit for English Studies of the University of Birmingham. In two further papers the results of corpus exploitation are presented. Altenberg investigates the frequency and use of recurrent verb-complement constructions in the London-Lund Corpus of Spoken English, while Peters presents the findings of a study of some points of disputed usage on the basis of the Brown and LOB corpora. In the last paper of this volume Meijs discusses the use of a computerized lexical knowledge system for the analysis of nominal compounds.

Schils, E. & P. de Haan. Characteristics of sentence length in running text, in Literary & Linguistic Computing, 8: 20-26.

This article reports on the examination of the distribution of sentence lengths in a number of running texts. More specifically, the question we try to find an answer to is what the relation is between the lengths of adjacent sentences.
We start our examination on the basis of the hypothesis that authors may aim at an alternation of long and short sentences. The results of autocorrelation scores of sentence lengths, however, show that in none of five corpus texts that we examined alternation of sentence lengths is a consistent feature. Rather the opposite is true: two adjacent sentences are more likely to have similar lengths. This is especially true in the fiction texts.
We then discuss two different ways of assessing the degree to which adjacent sentences in running texts are more likely to have equal lengths than any random selection of two sentences, viz. the mean squared jumps (i.e. a measure derived from the difference in length between two adjacent sentence) and the autocorrelation of jumps.
Finally, we test whether within a text there are fragments where alternation of sentence lengths might occur even though the text as a whole was characterized, by the other tests, as displaying no alternation. This is done by looking at consecutive fragments of 15 adjacent sentences, and relating the number of fragments with a certain degree of negative autocorrelation with the number of fragments that can be expected to have a negative autocorrelation under randomness. The two fiction texts are shown to have more fragments with a negative autocorrelation than can be expected under a random sequence hypothesis. A number of patterns for these fragments are established. They are exemplified in the last section.

1992

de Haan, P. The optimum corpus sample size?, in Leitner, G. (ed.), New dimensions in English language corpora. Methodology, results, software development. 3-19. Berlin - New York: Mouton de Gruyter.

Numerous corpus studies have been carried out in the past two to three decades, many of them on the standard corpora of American and British English (Brown and LOB), both consisting of 500 samples of 2000 words each. One element that has so far not been given much attention is the possible effect of the size of the corpus samples on the research results. As the corpora were assumed to represent a broad cross-section of English, and the focus originally was mainly on frequency and distribution of lexical items, the samples were considered to be sufficiently large to yield reliable results.
More recently corpus linguists' interest has gone beyond the lexical level to the level of syntactic description. The compilation of corpora, in the mean time, has continued very much in the style of the first two corpora, implying that they still usually consist of samples of 2000 words. The question arises whether samples of 2000 words are sufficiently large to yield reliable information on the frequency and distribution of syntactic structures.
Experience with samples of 20,000 words has shown that on the whole these are sufficiently large to yield statistically reliable results on frequency and distribution (cf. e.g. de Haan 1989: 50-51), but even they are sometimes too small, especially when complex interactions are studied (cf. e.g. de Haan & van Hout 1986: 89), simply because the number of observations yielded is too small. The conclusion seems to be that the suitability of the sample depends on the specific study that is undertaken, and that there is no such thing as the best, or optimum, sample size as such.
An experiment was conducted, aiming to assess the impact of reducing the sample size on the outcome of certain statistical analyses. The conclusion is that there are certainly studies that can only be successfully carried out on the basis of larger samples. This is not necessarily only related to the number of observations, but also to the nature of the phenomenon studied. This would appear to be especially true for the study of complex interactions, which requires more sophisticated statistical techniques.

Reference

Haan, P. de and R. van Hout (1986) Statistics and corpus analysis: A loglinear analysis of syntactic constraints on postmodifying clauses, in Aarts, J. and W. Meijs (eds.), Corpus linguistics II, (J. Aarts en W. Meijs eds.). Amsterdam: Rodopi. 79-98.

1991

de Haan, P. On the exploration of corpus data by means of problem-oriented tagging: Postmodifying clauses in the English noun phrase, in Johansson, S. & A.-B. Stenström (eds.), English computer corpora. Selected papers and research guide. 51-65. Berlin - New York: Mouton de Gruyter.

This article is a report on some of the results of a project that was completed in 1989. The aim of the project was to give a detailed description of a number of syntactic properties of postmodifying clauses (henceforth PMCs) in the English noun phrase, and to look at the way in which some of these properties are related to each other.
As the study is based on an examination of corpus examples, only surface structures have been considered. The noun phrase is described basically in terms of four constituents:
Determiner Premodifier Head Postmodifier
The object of study were those noun phrases in which the postmodifier consists of at least one clause.
The corpus that was used for this study is the Nijmegen corpus, which is available in database format. This corpus, which consists of approximately 130,000 words of running text, yielded 2430 postmodifying clauses in noun phrases. For each of these a numerical coding system was designed in which 26 different variable features were covered. A complete and detailed account of the way in which this was done can be found in De Haan (1984). The numerical codes were stored as a set of computer data which was subsequently processed by means of the statistical package SPSS.
In this article I discuss some of the implications of the methodology adopted, as well as a number of more general results. I discuss the clause patterns found, the position of the PMC in relation to its complexity and to the function of the noun phrase in which it occurs, the type of PMC in relation to the function of the noun phrase in which it occurs, the position of the various clause types, and, finally, I go into a few aspects of relative pronouns in relative clauses.

de Haan, P. TOSCA and beyond, in Jones, L.M. (ed.), Using Corpora. Proceedings of the seventh annual conference of the university of Waterloo centre for the new OED and text research. 92-109. Waterloo: UW Centre for the New OED and Text Research.

This paper presents a brief overview of the activities that were undertaken by the TOSCA Research Group in the 1980s. The emphasis in Nijmegen has always been on the syntactic analysis of corpus data.
The first project in which the TOSCA group was involved entailed manual tagging of the data and a subsequent automatic syntactic analysis on the basis of a context-free grammar. The experience gained in this project led us to adopt a different approach, viz. of making the analysis an automated process, on the basis of the formalism of the Extended Affix Grammar. This formalism was used in various projects, involving different languages. A relatively large corpus of British English texts was compiled, especially designed for the study of language variation. For the exploration of the analysed material a linguistic database was designed, containing the analysed sentences in the form of tree structures.
The paper concludes with the presentation of two projects in which specific applications of corpus research are discussed. One of them aimed at a quantitative description of the most common syntactic structures in English, while the other was concerned with characteristics of running text.

1989

de Haan, P. Postmodifying clauses in the English noun phrase. A corpus-based study. Amsterdam: Rodopi.

This book reports on the results of a project whose aim was to give a detailed description of a number of syntactic properties of postmodifying clauses in the English NP, and study the way in which some of these properties are related to each other. The study is base on an examination of corpus data. The structure of the NP is described basically in terms of four constituents:
Determiner Premodifier Head Postmodifier
The object of study was constituted by NPs in which the postmodifier contains at least one clause.
The corpus used for this study is the Nijmegen Corpus. This corpus, which consists of app. 130,000 words of running text, yielded 2,430 examples of postmodifying clauses in NPs. A numerical coding scheme was designed in which 26 different variable features were encoded, to characterise the postmodifying clauses. These were processed and analysed by means of SPSS.
The book discusses the methodology adopted and the results of the statistical analyses. The results relate to:

1988

de Haan, P. A corpus investigation into the behaviour of prepositional verbs, in Kytö, M., O. Ihalainen & M. Rissanen (eds.), Corpus linguistics, hard and soft. 121-135. Amsterdam: Rodopi.

This article presents a report on an investigation into the behaviour of so-called prepositional verbs in English. Three aspects are looked into, viz. the occurrence of prepositional verbs in passive verb phrases, the occurrence of adverbial insertion between the verb and the preposition, and the position of the preposition in relative clauses.
It is argued that the fact that combinations of verbs and prepositions occur in passive verb phrases is the only justification for their classification as prepositional verbs in English. Nevertheless, the question remains whether within the group of prepositional verbs a distinction can be made between combinations that are more close and those that are less close. This distinction may be visible in their behaviour in the three syntactic environment mentioned above.
The investigation of a small amount of corpus data shows us that there are certain tendencies in this respect. However, it appears that a larger amount of corpus data would have to be examined before we can draw any definite conclusions.

1987

de Haan, P. Relative clauses in indefinite noun phrases, in English Studies, 68: 171-189.

The discussion of relative clauses is often limited to that of relative clauses in definite NPs. Some attention is usually paid, especially in introductory works on practical English grammar, to the distinction between restrictive and non- restrictive clauses, but hardly any attention is paid to the occurrence of relative clauses in indefinite NPs.
This article makes a contribution to the insight, within descriptive linguistics, into the factors that influence the occurrence of relative clauses in indefinite NPs. The purpose of the article is twofold. It shows that, in general, the function of modifiers in indefinite NPs is different from that of modifiers in definite NPs. It also shows that in the case of relative clauses there are a number of additional factors influencing their occurrence in definite and indefinite NPs. In particular, the NP head, its reference and function in the superordinate structure play a role. Also the relative pronoun and the clause pattern in the relative clause itself are seen to be affected by the indefinite nature of the NP.

de Haan, P. Exploring the linguistic database: Noun phrase complexity and language variation, in Meijs, W. (ed.), Corpus linguistics and beyond. 151-165. Amsterdam: Rodopi.

Since Aarts (1971) no attempt has been made to demonstrate the relationship between the distribution of NP type and text variety, although Aarts hinted at the existence of such relationships. A more complex analysis of corpus material to investigate this relationship has recently become possible thanks to the Linguistic DataBase and the loglinear analysis.
An analysis of the written part of the Nijmegen corpus (app. 120,000 words) shows the existence of complex relationships between NP type, NP function and text variety. Four NP types are distinguished, ranging from basic to complex, while three NP functions are distinguished: subject, direct object and prepositional complement. The two text types distinguished are fiction and non-fiction. The relationships between these variables are shown to be highly significant.
Apart from the outcome of the statistical analysis of the data, the article provides an excellent example of the interactive use of the Linguistic DataBase.

Reference:

Aarts, F. 1971. On the distribution of noun-phrase types in English clause-structure, in Lingua, 26: 281-293.

Radboud University homepage Dept of English homepage

Mail to Pieter de Haan
This page is maintained by Pieter de Haan <P dot deHaan at-sign let dot ru dot nl>
Pieter de Haan's homepage