How can we tell whether Dickens is the author of the stories we have transcribed? Dr Andrea Nini is a forensic linguist who has been working with us on your transcriptions from Arthur Stone’s notebooks. In this guest blog, he explains the methods he uses to identify authorship and the results he has obtained so far.
We all know how it is possible to tell a lot about a person from the way they use language, for example from their accent or dialect. Similarly, our own unique linguistic identity is expressed in written language through the choice of words or grammatical constructions that we tend to use regularly, often unconsciously. Authorship analysis examines this written information to identify the author of a text often using computational and statistical methods.
Although authorship analysis can be applied to any problem of disputed authorship, the most typical questioned documents are either texts of literary/historical importance, like the ones in Arthur Stone’s notebook, or texts that have become evidence in a criminal case or civil dispute. The latter scenario falls under the umbrella of Forensic Linguistics, which deals with the application of Linguistics to the law. Examples of forensic authorship analysis can be the analysis of anonymous threatening letters (e.g. the Devil Strip case), or sets of incriminating text messages (e.g. the Amanda Birks case), or coerced police statements (e.g. the case of the Ayia Napa rape). In the literary/historical context, on the other hand, the disputed documents could be novels written by authors using pseudonyms (e.g. Robert Galbraith as J.K Rowling), historically important documents (e.g. the Jack the Ripper letters or the Bixby letter), or even poems or lyrics (e.g. The Rose of Tralee, the Dutch national anthem).
Within the context of the Dickens Code project, the most interesting authorship question is ‘who wrote the texts that Arthur Stone (and Dickens) have transcribed into shorthand?’. By ‘wrote’, we mean ‘composed’, that is, who produced the wording of the text or who is the wordsmith, the person who ‘put the words together’? If Arthur was writing a text that was dictated to him, then who wrote the text that was dictated? If, on the other hand, Arthur was transcribing something from improvised speech, who was the speaker? The most obvious candidate is clearly Dickens himself. Can these texts therefore be new undiscovered Dickens texts? My role in the project is to explore this hypothesis using modern forensic authorship analysis techniques that are applied in real criminal cases.
The specific authorship problem of having one candidate author and one disputed text is called authorship verification. This is challenging because, in order to be able to say that a candidate is the author of a text, one must exclude any other possible potential candidates. In most cases, this is a very long list! The field of computational authorship analysis has spent a couple of decades looking for solutions to this problem and one technique that has been found to be reliable and robust over the years is known as the Impostors Method. The concept behind the method is rather intuitive: when the police want to verify that an eye-witness is genuinely confident about their visual recognition of a suspect, they ask the witness to pick out the person they have seen from a line-up that includes the suspect and a number of impostors who resemble the suspect. You can do a similar linguistic line-up for written texts. This line-up would include a set of texts by the ‘suspect author’ (the author you think the text was written by) and sets of texts by ‘impostor authors’. By statistically comparing these different sets of texts against the disputed text, you may find which set is the best match. This is a process which needs to be approached correctly.
In order for the impostors to be good impostors, you need to make a representative collection, called a ‘corpus’, of their texts. The most important characteristic that the corpus of texts needs to share is their genre. For example, it only makes sense to compare novels with novels rather than novels with essays. In the case of the Dickens Code project, we are not really sure what the questioned texts are and whether their author was improvising or dictating from a written source. If they are dictating from a written text, then comparing the questioned texts to other narrative texts is not a problem. However, if they are improvising, then we simply don’t know whether the method can work because there isn’t enough research on this particular mode of production.
How does the testing of corpora actually work? Essentially, without going into technical details, we are statistically comparing how similar texts are to each other in the way that they use grammar and combinations of words. After a large number of these comparisons have been made, if the candidate author corpus (in our case Dickens) turns out to be the most similar one to the questioned text (the transcribed story) compared to the corpora of impostor authors (other Victorian writers), you can be confident that the candidate author is indeed the real author.
When the first transcribed text was announced, the Two Brothers story, I immediately set out to test whether perhaps we had discovered a new version of the story written by Dickens, maybe an earlier draft of the version published in To be Read at Dusk. Because the preparation of the data to apply the method is the most important and time-consuming task, in order to present my findings at a Symposium in Rome in July 2022, I decided to run a simple pilot study using an already available corpus of novels which included Dickens and other comparable nineteenth century authors.
I remember how, with some excitement, I sat down during a very hot Italian summer to write the computer code to run the analysis and how I impatiently waited for the results. After about half an hour of computations, the answer appeared on my screen: the text was very unlikely to have been written by Dickens. I repeated the analysis a few times with random variations to verify that the results were robust. These results confirmed that a Dickens authorship was unlikely. However, while doing so, I noticed something quite atypical: one of the impostors kept coming back with a very strong score, which would be more typical for the correct candidate author. This situation reminded me of the Dutch national anthem case mentioned above, where the researchers accidentally discovered the most likely author only because he was treated as one of the impostors. Who was our mysterious impostor? I was surprised to discover that the method was returning very strong scores of similarity for Elizabeth Gaskell!
When such a thing happens, a careful researcher should treat it as a statistical fluke. However, if this was a fluke, then it had to be a really unlikely one because it is well-known that Gaskell had accused Dickens of stealing one of her ghost stories in the past. This connection seemed to be too strange to be a coincidence! I therefore presented these results at the symposium, with the added caveat that this was a pilot study that needed more carefully collected data for replication.
After the summer of 2022, I began to re-test the Dickens and Gaskell authorship hypotheses in a forensically thorough way. With the help of Dr Lucy Hanks, who is a Gaskell expert, we collected a more comprehensive data set. The final set contains around five million words for Dickens, two million words for Gaskell, and almost ten million words for other authors who could be considered good impostors for either of them, such as Collins, Thackeray, Trollope, Eliot, or Edwards.
Armed with this excellent data set, I was then ready to run my analysis again, this time with more confidence in its robustness. I processed the data with my computer code and after a few repetitions, it became quite clear that the Gaskell signal was not there. The results did not exclude Gaskell as strongly as they excluded Dickens. However, the magnitude of the signal was small and therefore more leaning towards an exclusion of Gaskell as well or, at best, to an inconclusive result.
In a very surprising series of events, therefore, this new analysis seems to suggest that the Rome discovery had indeed been just a statistical fluke, maybe caused by the composition of the data set I used for the pilot. At least, this is the most likely hypothesis arising from the linguistic evidence. It could also be the case that the pilot study had detected something real but that the new analysis, which contained a lot of data where the results were clear, better explained it as noise.
In sum, although the results of the analysis when interpreted with the rigour of a real forensic analysis, do not support a Gaskell authorship, there might still be a chance that Gaskell had something to do with this. Authorship can sometimes be a complex construct and, in this particular case, we do not know the exact methods of production of the disputed text. Could Dickens have been reading a Gaskell story while also improvising and therefore adding some of his linguistic identity to the text? Whatever happened in that room and wherever our imagination can take us, the forensic linguistic conclusion, disappointingly, can only be that there is no support for the hypothesis that Gaskell was the real author.
After analysing the Two Brothers story, I then focused on two other decoded texts, Anecdote, and Nelson. In both cases, the result came up with a strong ‘no support’ conclusion for either a Dickens or a Gaskell authorship. Although there was no reason to include Gaskell as a potential candidate for the other stories, I added her because of the strange Rome discovery. The fact that neither she nor Dickens came up as even potential suspects is, at the very least, intriguing. Who composed these texts that Arthur wrote in shorthand?
Because of this unsolved mystery, I feel that research on authorship will still play a part in the future of the Dickens Code project. As more texts are decoded, the Dickens authorship hypothesis will be tested as well as, if available, alternative plausible authorship hypotheses. If these texts were dictated, then the forensic authorship analysis methods should be able to detect the real author’s signal, whether this is Dickens or someone else. Perhaps a new literary mystery is waiting for us just around the corner.