TAACO: Tool for the Automatic Analysis of Text Cohesion
Cohesion is a crucial element for understanding texts, particularly for challenging texts that present knowledge demands to the reader (Loxterman, Beck, & McKeown, 1994; McNamara, Kintsch, Songer, & Kintsch, 1996; McNamara & Kintsch, 1996). Hence, measuring cohesion is an important element of discourse processing research (McNamara, Louwerse, McCarthy, & Graesser, 2010). However, freely available natural language processing (NLP) tools that measure linguistic features related to text cohesion are limited.
The Tool for the Automatic Analysis of Text Cohesion (TAACO) is a freely available text analysis tool that is easy to use, works on most operating systems (Windows, Mac, Linux), is housed on a user’s hard drive (as compared to an internet interface), allows for batch processing of text files, and incorporates over 150 classic and recently developed indices related to text cohesion. The cohesion indices reported by TAACO evenly focus on local cohesion, global cohesion, and overall text cohesion. Local cohesion refers to cohesion at the sentence level (i.e., cohesion between smaller chunks of text) while global cohesion refers to cohesion between larger chunks of text (usually paragraphs). Overall text cohesion refers to the incidence of cohesion features in an entire text, but not in comparison to other parts of the text, such as lexical diversity across a text. Many TAACO indices incorporate a part of speech tagger from the Natural Language Tool Kit (Bird, Loper, & Klein, 2009) and synonym sets (synsets) from the WordNet lexical database (Miller, 1995).
TAACO is available here: Tool for the Automatic Analysis of Cohesion
Indices reported by TAACO
- Sentence overlap: These are indices of local cohesion that calculate lemma overlap between adjacent sentences: all lemma overlap, content word lemma overlap, lemma part of speech overlap.
- Paragraph overlap: These are indices of global cohesion that calculate lemma overlap between adjacent paragraphs and between three adjacent paragraphs: all lemma overlap, content word lemma overlap, lemma part of speech overlap.
- Semantic overlap: These are indices of both local and global cohesion that use the WordNet database and measure overlap between words and word synsets between sentences and between paragraphs.
- Givenness: These are indices of text cohesion that calculate the number of pronouns, pronoun type (i.e. first, second, third, subject, quantity), definite articles, and demonstratives.
- Type-token ratio: These are indices of text cohesion that measure the repetition of words in the text by dividing the total number of words (tokens) by the number of individual words (types).
- Connectives: These are indices of local cohesion that calculate the number of 1) positive vs negative connectives; 2) temporal, additive, and causative connectives.
References/further reading
Bird, S., Loper, E., & Klein, K. (2009). Natural Language Processing with Python. O’Reilly Media Inc.
Loxterman, J. A., Beck, I. L., & McKeown, M. G. (1994). The effects of thinking aloud during reading on students’ comprehension of more or less coherent text. Reading Research Quarterly, 353-367.
Miller, G. A. (1995). WordNet: A Lexical Database for English. Communications of the ACM Vol. 38, No. 11: 39-41.
McNamara, D.S., Kintsch, E., Songer, N.B., & Kintsch, W. (1996). Are good texts always better? Interactions of text coherence, background knowledge, and levels of understanding in learning from text. Cognition and Instruction, 14, 1-43.
McNamara, D.S., & Kintsch, W. (1996). Learning from text: Effects of prior knowledge and text coherence. Discourse Processes, 22, 247-288.
McNamara, D.S., Louwerse, M.M., McCarthy, P.M., & Graesser, A.C. (2010). Coh-Metrix: Capturing linguistic features of cohesion. Discourse Processes, 47, 292-330.