What is novelty, and how can we measure it? Neither is a small or straightforward question. Though philosophers, scientists, and artists have had quite some time to reflect on the subject, we have yet to determine what novelty is with any precision in order to measure it, let alone to measure it at scale. Yet novelty is an explanatory factor in many accounts of literary historical periodization, particularly in the 20th century. Literary modernism was known for its novelty and innovation, both in response to the rather uniform Victorian literature that preceded it and in relation to quick and widespread cultural change. While scholars of digital humanities have computationally analyzed questions of literary change and difference, there has been less attention to moments of innovation or novelty– that is, the moments in which change is introduced and difference is created.

As a result of the ongoing collaboration between myself, Devin Higgins (MSU Libraries) and Arend Hintze (Integrative Biology, Computer Science & Engineering), we have developed a measurement for literary novelty, applying methods from metagenomics to the study of literary texts. This method allows us to 1) uncover the textual features that distinguish a text as novel, 2) compare patterns in novelty across texts, and 3) propose a typology of literary novelty in the 20th century. Pivoting between close and distant reading enables us to understand the intricate dynamics of novelty both as form and as symbolic capital.

Phase I, complete (August 2017): we developed the measure and tested it on both individual texts and on small corpora of texts. This first phase has offered suggestive preliminary results, but needs further corroboration with a larger number of literary texts.

Phase II, (August 2017-June 2018): we will replicate our original study with a much larger corpus of in-copyright, post-1923 novels in order to further understand both the relationship of our method to the experience of reading highly novel texts, and the patterns of difference that emerge across period, canonicity, authorship, and prestige. This work is being supported by a HathiTrust Advanced Collaborative Support Grant. We were thrilled to be selected for an ACS, and especially honored to be included with such fantastic projects. 


The Measurement

Recently, digital humanists have turned to biological methods to consider how literary mutation occurs. In order to measure literary novelty, we have similarly applied biological methods to the study of literature, drawing from metagenomics, a field of biology. Metagenomics is similarly concerned with identifying moments of novelty and subsequent change, sequencing large amounts of genetic material to determine the diversity and nature of species living at a particular place on earth. Material must be assessed quickly to determine if a sample contains genetic material that has not been seen before, or, if known, to what species it belongs. This process is conducted using a Bloom Filter, which quickly assesses whether a sequence of nucleotides has been seen before or not. This tool can be used on sequences of letters (mers) of arbitrary sources, including literary texts. Using the Bloom Filter, we are able to get (much closer to) an objective measurement of literary novelty, by which we can compare texts against each one another, both individually and within groups.

The Bloom Filter allows us to identify moments of similarity and moments of difference, but also calculates the relative degree of novelty exhibited in those differences.The Bloom Filter scans the entirety of a text, tracking the repetition of language at incredibly small, precise intervals called k-mers; in our case, a fixed 12-character window called a 12-mer. Punctuation and spaces are included in each k-mer, which is encoded according to its ASCII value using a modulo-32 operation to reduce the number of bits required to encode each character from 8 to 5. The Bloom Filter moves k-mer by k-mer through the text, advancing one character at a time. The Filter develops a growing database of 12-mers; each new k-mer is assessed for its presence (novelty score of 0, i.e., not novel) or absence (novelty score of 1, i.e., novel) in the database, and k-mers that are not yet in the database are added. The process is repeated over the course of an entire text, assigning a score of 0 or 1 to each k-mer. Thus, a text will remain 100% novel until the point at which a k-mer is repeated, and then assigned a 0. So, consider a scan of Gertrude Stein’s “Rose is a rose is a rose.”  While at the beginning the term “rose” is new, it will become less novel the more it is used, and would only be perceived as novel if the context changed. The Bloom filter thus assesses each k-mer’s novelty; we aggregate these answers over 10,000 mers to get a fraction of novelty for each section of a text.

Preliminary Results

  1. Patterns within individual texts (Intratextual Novelty)

We begin by scoring the novelty of individual texts using the Bloom Filter. These scores can then be visualized, providing us with a general shape of novelty, from cover to cover, thus revealing the course of novelty throughout the text. We quantify how much novelty decays (the slope) and how much novelty deviates from the best fit slope (R^2). We refer to these visualizations as a text’s “intratextual novelty.” Identified by Michael Levenson, intratextual novelty includes “all those relations threaded within the boundaries of the artifact itself,” (669), the Bloom Filter allows us to quantify the ways in which novelty is deployed—or language is reinvented—within the text as a self-contained totality.

A text or passage that is completely non-repetitive will remain novel throughout. Conventional texts will reveal a steady rate of decay. Texts that alternate between repetition and variation will register changes in novelty. Middlemarch, for example, has a steady decay in novelty without any moments of substantial  change. By contrast, Ulysses shows changes in novelty throughout, corresponding to known instances of formal change or innovation within the book, resulting in the jagged decay line below. These results illustrate an alignment between our novelty measure with the Bloom Filter and the reader’s understanding of the text.

2. Comparing between texts (intertextual novelty)

After treating each novel as a self-contained system, measuring the ways that it uses and reuses language internally, we then focused on “intertextual novelty,” in order to compare how novelty is different between texts and groups of texts. We rely on two primary metrics: and slope. Neither can be taken in isolation, but together, these two measures are surprisingly descriptive, detailing both the degree of variation in novelty () and continuity of novel moments (slope) in a text. These novelty scores allows us to compare texts to one another, considering macro-level questions. By graphing thevalue against the slope, we established a baseline for “typical novelty” given our limited corpus—that is, the way that most texts deploy novelty internally— while accounting for variation in textual patterns. We compared the intratextual novelty of Text A against the intratextual novelty of Text B. Our nextstep was to run the Bloom Filter against a corpus of texts, graphing each according to its value (x-axis) and slope (y-axis). This informs us about the distribution of novelty within each category, and begins to answer the question how literary periods differ with respect to novelty, as well as how novelty might relate to canonicity and prestige.

We find that texts from the Victorian period are very similar to one another, sharing a common slope and R^2 indicating that all these texts are very similar to each other. Within modernism on the other hand, we find greater differences; some texts resemble Victorian texts, while some, like Ulysses and The Sound and the Fury exhibit a high level of intratextual novelty. In addition, there is a marked difference between the way that these periods deploy novelty structurally. In other words, modernist texts are more continuously novel than Victorian texts. This initial result supports common disciplinary claims about novelty as a defining characteristic of modernist literature.

However, we find modernist and postwar texts to be significantly different along both measures. Postwar texts, in fact, exhibit greater internal variance and more continuous novelty than do modernist texts. This challenges the notion that postwar literature is simply a continuation of the modernist project, showing a quantifiable change and encouraging a deeper investigation – which would be impossible to do without support from HTRC.

And finally, a graph for fun: 

All of our texts, graphed on top of one another– in technicolor.