How Far Can Lexicometry Take Us?

Nov 13

Notes from a Historian’s Perspective

Albin Wagener, a linguist working in the fields of discourse theory and digital humanities, develops in Discours et système (Peter Lang, 2019) a model that treats discourse as an interconnected system of interactions and semiotic practices. Historians can recognize the value of this undertaking, since the book tries to explain how meaning circulates through networks of actors and texts rather than through isolated statements. At the same time, the framework leans toward abstraction in ways that can make historical texture harder to recover; contingency often appears only at the margins. The lexicometric tools he employs reveal patterns that might otherwise remain obscure, though they risk reducing complex sources to quantifiable fragments and obscuring the conditions under which those sources were produced. His approach is most helpful when it prompts readers to notice relationships among representations; it is less persuasive when it treats discourse as a self-regulating environment rather than as a historically situated field shaped by conflict. A charitable reading sees the book as a generative theoretical resource, provided one adapts its insights to the evidentiary and contextual demands that guide historical scholarship.

IRaMuTeQ is an open-source environment that connects R to a set of lexicometric routines. (As a reminder, R is an open-source programming language and software environment designed for statistical computing. It emerged in the early 1990s as a free implementation of the S language, and it has since become central in fields that rely heavily on quantitative inquiry.) It segments texts into units, counts surface forms and lemmas, measures their co-occurrence, then arranges those counts through clustering and factor analyses. In Wagener’s recent article on French republicanism as a “state religion,” IRaMuTeQ underpins the empirical work: it produces word clouds that foreground République, républicain, and laïcité, and it generates thematic classes in which Islam, youth, violence, schooling, and republican principles appear as structured clusters. The article then reads these structures as evidence for a broader discursive configuration in which the Republic functions as a doctrinal system. That use case offers a convenient vantage point for thinking about the suitability of IRaMuTeQ-style technologies for discourse analysis.

These tools, it must be said, achieve a certain descriptive clarity. They make it possible to see at a glance which terms dominate a corpus and which terms tend to travel together. They draw visible boundaries between sub-corpora, revealing that some vocabularies are concentrated in state texts, while others belong more to media or movement sources. In a study such as Wagener’s, this allows the author to say with some confidence that references to laïcité outstrip references to, say, liberté, or that Islam regularly appears in proximity to notions of violence and youth in French media. Framed in the most charitable light, this software disciplines intuition and memory because it forces the researcher to confront the actual distribution of words rather than an impressionistic sense of emphasis.

Once the analysis moves from lexical patterning to discourse in a thicker sense, the fit between tool and object grows more fragile. Many discourse-analytic traditions treat discourse as a set of situated social practices. They are interested in who speaks, to whom, and under what institutional constraints. They attend to how speakers position themselves and their audiences, how they claim authority or disclaim it, how they frame adversaries or allies. IRaMuTeQ has no access to those relational and pragmatic features. It parses a presidential speech, a prefectural charter, and a critical op-ed as sequences of segments and tokens. A sentence that orders citizens to “respect the values of the Republic” and a sentence that reports that “the government claims citizens must respect the values of the Republic” generate almost identical lexical information. Modality, reported speech, evaluation, and stance vanish in the process. The algorithm can show that “respect,” “valeurs,” and “République” cluster; it cannot determine whether a given passage enacts obedience, irony, distance, or outright dissent.

I think this reduction creates a temptation. Once a clustering algorithm has produced classes, and once those classes appear on a factor map, it becomes very easy to relabel them as “narratives,” “themes,” or “ideological cores.” The graphical display encourages that move. A dense cluster of segments in which Islam, youth, and violence co-occur sits near the center of a plot. The eye draws a conclusion: this must be the central narrative around which everything else turns. In Wagener’s article, that visual centrality helps justify the claim that controversies around Islam and youth form the axis of contemporary republican discourse. Yet the underlying mathematics does not know anything about narrative centrality or ideological salience. It knows only variance and proximity in a high-dimensional lexical space. The semantic interpretation arrives later and remains an interpretive act, even when it is accompanied by quantitative output.

Another difficulty arises from the way IRaMuTeQ flattens genre and institutional logic. In Wagener’s corpus, presidential allocutions, ministry guidance documents, prefectural “Charters of Republican Values,” media reports, and polemical columns all enter the same analytical pipeline. Each of these genres has its own constraints and routines. A charter is expected to sound prescriptive and schematic. A journalist often aims for condensed summary or vivid quotation. A president performs solemnity and authority. When IRaMuTeQ pools all of these, it treats their vocabularies as equally diagnostic of “the discourse” on the Republic. Any shared lexical field then risks being read as an ideological formation, even when part of that field might arise from genre conventions or bureaucratic style. Administrative boilerplate has a certain cadence in any political system. Lexicometry can easily transform those recurrent formulae into proof of a catechism, without pausing to ask how much of the pattern belongs to the specific political culture and how much to institutional routine.

The selection of a themed corpus accentuates this problem. Tools like IRaMuTeQ work on the material they are given. When a researcher compiles only texts that explicitly discuss “values of the Republic,” the software will faithfully reflect that choice. It will show a field saturated with references to principles, respect, laïcité, rights, and duties, because the corpus was assembled to foreground that very vocabulary. The tool cannot speak to what lies outside the frame: speeches in which the Republic is mentioned in passing, local debates where other idioms dominate, everyday talk in which republican language never appears at all. The danger here is circularity. One builds a corpus around highly codified, self-conscious invocations of the Republic. One then uses IRaMuTeQ to demonstrate that republican discourse is highly codified and self-conscious. The software has done its job; the design of the corpus has done much of the substantive work, quietly, in advance.

A further limitation concerns scale and granularity. IRaMuTeQ excels at identifying recurrent lexical associations in large data sets. It has more difficulty with subtle shifts of meaning within a single term. In the French political lexicon, laïcité can carry quite different senses: legal neutrality of the state, militant secularism, or a more recent exclusionary stance toward “visible” religion (let the reader understand). Those distinctions often emerge in collocations, or in argument structure, or—obviously—in context. They also depend on the speaker’s political position and the interlocutor addressed. IRaMuTeQ can register the frequency of laïcité and the words that appear near it, although it struggles to distinguish between a usage that signals openness and one that signals hostility, unless those attitudes are made explicit in nearby vocabulary. The result is a flattened conceptual field that risks masking internal tensions within the very terms that matter most.

There is also the question of what vanishes completely from view. Discourse analysts often track phenomena that do not lend themselves easily to token counts: the layering of voices in a text, the way an author ventriloquizes adversaries, the use of ellipsis or strategic silence, the interplay between written and visual elements, the way slogans echo across sites and are then re-appropriated. IRaMuTeQ works only on the textual material it can parse. It does not register the absence of certain words where their presence might be expected. It does not register a gesture or an ironic tone of voice. In a field like the politics of laïcité, those non-lexical cues often carry much of the affective and ideological charge. A purely lexicometric approach risks missing that charge or misreading it.

None of this renders IRaMuTeQ useless for discourse analysis. The tool can play a constructive role as a first pass over a corpus. It can falsify claims that a term is “everywhere” when it appears rarely. It can reveal that certain associations are systematically stronger in one institutional register than in another. It can help identify segments that deserve closer reading because they sit at the intersection of multiple clusters. Where problems arise is in the impulse to treat lexicometric output as sufficient. Once a researcher begins to equate “discourse” with frequency tables and dendrograms, the scope of the inquiry narrows. Questions that cannot be answered with counts and clusters may no longer be asked with the same energy.

A more cautious posture would treat IRaMuTeQ as one instrument in a wider methodological ensemble. Lexicometric mapping can inform case selection and can challenge unexamined assumptions. Close reading can then unpack argumentative moves in those cases. Ethnographic or interview-based work can trace how texts are received or contested (or ignored). In a study of republican rhetoric, this would mean moving from the bird’s-eye view that IRaMuTeQ affords to the situated perspectives of teachers, students, civil servants, and members of the communities most often spoken about. The technology can help (very provisionally) outline the terrain. It cannot substitute for the work of walking through it, listening to the voices that inhabit it, and attending to the silences that lexical statistics leave untouched.

Keanu Heydari

Keanu Heydari is a historian of modern Europe and the Iranian diaspora.

https://keanuheydari.com

How Far Can Lexicometry Take Us?

Notes from a Historian’s Perspective

A Woman on the Podium: Paniz Faryousefi and the Tehran Symphony Orchestra

The Politics of Sartre’s Grabuge