Digitized Texts and the Problems of Textual Analysis for Historical Japanese Books

Paper presented at Histories of the Japanese Book: Past Present Future, 2013

The volume of digitized texts in online databases in Japan and elsewhere has been growing in the past years, and is continually increasing as more libraries, universities, and centers become involved in such projects. Some notable databases with digitized page images of rare and historical books include HathiTrust, the National Diet Library’s Kindai Digital Library, and Waseda University’s Kotenseki Sōgō Database of Japanese and Chinese Classics. This type of database, and the general unavailability of those for Japanese books that do not focus on page images, raises several issues for the study of Japanese literature and history via Edo, Meiji, and Taishō period texts, while at the same time making possible research that might otherwise not be able to take place without remote access to these images. They encourage the study of the book as a visual object, and also facilitate quantitative research of publishing history by providing metadata for a large number of books; yet, at the same time, because they do not include copies of texts in machine-readable format, they preclude quantitative and qualitative research across texts using now-standard techniques such as those included under the umbrella of natural language processing (NLP). This stands out as an issue in the study of the Japanese book and literature, because English-language corpora of machine-readable texts such as nineteenth-century novels have been growing and now support large-scale research using NLP.

Yet analysis of plain text, and creating machine-readable text in the first place, also raises theoretical issues. Problematics specific to Japanese-language sources include the separation of text from image, especially in the Edo period, and the use of ruby, or glosses, to add nuance to Chinese characters by providing alternate readings, which could be lost in the process of creating a machine-readable text. Here, I will investigate the problems raised by machine-based textual and topic analysis for historical Japanese texts, and the qualities of studying the book as a visual and material object as opposed to a decontextualized text. My presentation raises larger questions about the applicability of methods used for the analysis of Western books, such as Text Encoding Initiative, or TEI, markup and the use of logocentric machine-readable text, to books from Japan. Broadly, I consider the advantages and drawbacks of both visual and textual approaches to the study of the book, and the implications of these for the study of books produced worldwide, not just in Japan.

I would first like to introduce several examples of the kind of database of book page images that I have mentioned, and the ways in which they shape our approaches to the study of texts via digital surrogates. In the first place, there are vast inconsistencies across, and even within, databases with regard to the quality of those surrogates, and they raise the question of what kind of surrogate is sufficient for the study of the Japanese book. Our first criterion must, I think, be readability, the simple ability to process the text linguistically in order to approach it as scholars of literature and history. HathiTrust, and the Google Books project in general, are extremely problematic in this regard; not only are some pages not readable to the human eye without considerable difficulty, but this makes Optical Character Recognition, or OCR, impossible. This sample page from HathiTrust’s scan of a book reproduction of Shigarami zōshi, a Meiji periodical, is barely readable to humans, and not readable at all for a computer. This is an extreme case, and I’ll now show two examples of books by Wakamatsu Shizuko, a Meiji author, digitized and made accessible by the NDL’s Kindai Digital Library. As you can see, the first case is a poor digitization, and is in fact most likely produced from a scan of microfilm; the Kindai Digital Library’s initial holdings were all based on the Maruzen Meiji Microfilm collection, and reflect the poor quality of both scanning and the original source from which the scans are taken. (Incidentally, it also demonstrates that this kind of resource was available even in the pre-digital age, enabling remote research using sometimes rare texts.)

Even if Kindai Digital Library performed OCR on its texts, the quality of this scan is such that it may be difficult or even impossible to do so; for a human, this text is readable but much information is lost to us about the book itself, such as any color that might have been used, as well as the character and quality of the cover and pages. The second example is characteristic of the newer holdings in the Kindai Digital Library; it is in color and we can glean much more information about the book itself from this digital surrogate. Because its metadata includes information about size and page count, it’s possible to discern much about this book from the digital surrogate, even without consulting it in person. Thus, it gives us much more than the “text” itself: it provides information essential to understanding publishing and reading in the Meiji period, which we can barely glean from the previous, lower-quality, black-and-white example. Even though OCR may still not be possible in this case, for reasons I will discuss in a moment, as a human, not only can one read the text but also “read” the qualities of the book itself: essential for the study of the book as an object, as in many cases in the field of book history. These two examples illustrate the problem of judging digital surrogates: for the study of the language of a text, the former might be enough as it allows access to the logocentric text itself; but for the study of the book, publishing history, or other related fields, it is insufficient. Even in the case of literary studies, often the study may focus on more than just the language of the text itself, and so it’s important to understand more about the text’s material form as a book, for example in order to glean information about reception from looking at the book. Only the higher quality, color scan can provide us with the information needed for such study, and so in this case, only the latter truly counts as a digital surrogate of any value.

Edo texts available online are less problematic in the study of the book as an object, and in particular, as a visual object. Waseda University’s Kotenseki Sōgō Database of Japanese and Chinese Classics, as well as some holdings of the National Institute of Japanese Literature, are available online as full-color digitizations. The NIJL’s holdings are somewhat problematic in that they do not offer digitizations of full books or manuscripts; rather, they provide sample images. Moreover, these images are of relatively small size and are often blurry, as though they were taken with a point-and-shoot camera by someone without expertise in digitization of rare books. Thus, we see again the problem of digitizations that we may not be able to characterize as digital surrogates of works, because they are insufficient as surrogates to begin with. The Waseda database, on the other hand, is of archival quality, offering full digitizations of works taken at high resolution, in color, and in crisp focus. Not only are the page images of high quality, they also offer color samples at the top of each page for reference. In these respects, Waseda’s digitization efforts follow established standards for the field and provide researchers with true surrogates, as far as possible, for a wide variety of Edo books and manuscripts. Moreover, they are downloadable as PDFs for easy access offline, a relative rarity in Japanese databases of digitized texts.

As you can see in this sample page from a Waseda digitization, combined with metadata that identifies the size and page count of the book, it is fully possible to study the book itself as an object here. Moreover, it can be studied as a text by reading through the page images, which is particularly facilitated by the PDF version, which can be paged through just as easily as with any modern eBook, as opposed to the online HTML version, which necessitates clicking on and viewing each page image separately. Thus, the high quality and multiple options available in Waseda’s database make possible multiple types of research using digital surrogates, and indeed using the metadata provided about the books and manuscripts within the database’s search results themselves. It is an exemplar of a digitization project that facilitates both qualitative and quantitative research, and takes advantage of the multiple opportunities that the digital format of both catalog and surrogate provide. We can imagine that while Waseda’s database might not replace going to Japan to look at a book in person, it comes close to a substitute, and may enable a great deal of remote research that would otherwise not be possible, both in terms of finances and time spent abroad. As exemplary as Waseda’s database is, however, we are still left with the question of how to perform fast analysis across a broad corpus of texts. In other words, just like the other databases introduced here, it provides no OCR for early modern and modern books, and of course no hand-entered, marked up version of the text in machine- or human-readable format. We’re limited to individual, human-paced close readings of books’ page images, not even able to efficiently leaf through the book, but rather encouraged to focus on pages one-by-one.

Let’s look at an example of a digital archive that does provide texts that are somewhat machine-readable, as well as human-readable (unlike, for example, Google Books, which stores machine-readable OCRed versions of more modern books for which OCR is possible, but are still in copyright and thus human-readable versions of this plain text cannot be made available). It’s probably familiar to everyone in this room: it is the Aozora Bunko project. This project provides painstakingly hand-entered versions of public domain texts, which are as far as I can tell all from the modern period. At first glance, we may think we have found the perfect archive of texts that allow for machine analysis across a broad corpus, with plain text versions available for a computer to process. However, you can see from this image that there is one additional element that we have to take into account: glosses, or what are called ruby. Thankfully for human readers, Aozora texts by and large all contain ruby; they also sometimes have a choice for either traditional or modern characters and hiragana. This is a boon for those of us who want to use Aozora texts but also want to be as close to the original as possible in our reading, or perhaps want to assign a text to a class using the Aozora version. But as I’ll explain here, this also raises some issues for processing texts with computers.

In terms of text processing, just as much as plain text (like Aozora), digitized texts are often marked up using Extensible Markup Language, or XML. XML is used for multiple purposes: it can be used to mark up a text semantically, marking individual elements, or in the case of a specialized version of XML, the Text Encoding Initiative or TEI, used to mark up a text visually. This means that, for example, a manuscript that is being inputted can be marked up to show where a word was crossed out or where annotations were written by hand in the margins, or in the case of a book, again, hand annotations, or italics, or images. This creates a process-able text that contains a huge amount of information about the original that it was created from, and is a great help for book and literary historians. Using programming languages and stylesheets like XSLT and CSS, you can display these texts very flexibly, and use the information contained in the markup tags to, for example, display original italics in italics on the computer, deleted words with strikeout, and raise hand annotations so they display above the main text or, for example, in boldface or a different font. So this gives great options for displaying texts in digital archives online. Actually, Aozora does not use TEI, which is a very rich and standardized markup language, but it does use XML and HTML to place ruby where they’re supposed to be, above the words that they gloss, using the <rb> tag. Here’s an example of what that looks like in the markup itself. And here’s an example of a web browser displaying ruby properly. Not all browsers do that, but it’s becoming more standard. So you can see that using a markup language to mark off elements of the text as being apart from the main text, or special in some way, is an important feature of creating a digital text so that it’s potentially more readable and more rich for humans as well.

Using XSLT, which is often called a stylesheet language but is actually a functional programming language as well, in conjunction with XML and also with specifically TEI texts, you can do even more than display elements in different ways. You can use this language to create a concordance and other statistics, and also allow users to interact with texts by showing keywords in context, or highlighting or replacing words in a given text. Potentially, in the case of an Aozora text, you could also manipulate the ruby or come up with a list of all ruby-glossed words, for example, using that <rb> tag that I showed you. Without the markup tags, XSLT can’t understand the document, so in this way the <rb> and other tags are making this text machine-readable in a rich way that facilitates text analysis by giving parts of the text meaning.

However, there are some issues as well as potentials here. One major issue in terms of compatibility and standards is that while TEI is a broadly-accepted standard, it’s basically created for Western books. It doesn’t have any standard element for marking up ruby, which is hugely important in Japanese books from the past to the present; you don’t find ruby only in early modern and Meiji texts, although it’s quite common there, but even in modern manga like the series “Death Note.” Moreover, the ruby don’t just provide readings for hard characters; in the case of “Death Note,” for example, they provide alternate readings for the characters that add an additional layer of meaning that is lost when the ruby aren’t included. So marking up a Japanese text using TEI, trying to make it adhere to a standard accepted in the Western world, is basically impossible without losing that layer of meaning, and thus TEI, while hugely beneficial for those of us creating digital texts, necessarily separates non-Western books that use ruby. But I think this is an issue for more than just ruby-laden Japanese books; TEI also doesn’t accommodate glosses very well in general, and thus it’s easy to lose a layer of meaning added to Western texts with this type of gloss as well. Ruby highlight a difficulty of translating texts to machine-readable digital format, but rather than being limited to Japanese books alone, they raise the issue for digital texts more broadly.

Ruby and non-standard or traditional characters create more problems than this, though. One currently popular method of analyzing texts with computers is topic modeling, a branch of natural language processing, specifically semantic analysis, that uses statistics to identify abstract “topics” that appear in documents. This can be used for a number of purposes, including automatic summarization, and can also tell you about the proportion of the document dedicated to particular topics. When looking at a large corpus and trying to determine what it tends to address or focus on most, this is a particularly useful way to analyze the texts without having to read through each one. When practicing methodologies like Franco Moretti’s distant reading, topic modeling can prove an effective way to start formulating hypotheses about the texts, even if it doesn't always necessarily succeed and I’m not arguing that it replaces other humanistic methodologies that lead us to conclusions in our research. It’s a popular and useful tool, however, and is becoming used in digital literary projects as well as other research.

However, let’s think back to what kind of texts we need in order to perform topic modeling. First, we need a machine-readable text. So, can we use an Aozora corpus for this? We can potentially do that, but we run up against the wall of ruby here, among many other issues in natural language processing, including identifying named entities or even parts of speech. (Incidentally, there are many more tools for doing this in English; the only successful tool for Japanese that I know about is a program called KHCoder, which works on modern Japanese only.) How do we account for the multiple layers of meaning that ruby introduce? Do we count the word and its gloss as two words? Do we throw out the ruby? Substitute it for the word? And how do we handle non-standard and traditional characters? Does it matter that we may link them with or substitute them with current standard characters so the computer can process them? Topic modeling poses a huge problem for Japanese texts, and I’m just using it as one example of the difficulty of using natural language processing techniques that use texts with multiple layers of linguistic meaning like traditional Japanese books. We have only to think of annotated manuscripts or annotated Chinese books in Japan to see more problems for natural language processing – how do you treat the annotations, as part of the main text or something that should be ignored in analysis? Or something different entirely?

Finally, I’d like to remark on the difficulty of using natural language processing techniques like these with books that are heavily visual. For example, Edo books like the one shown, especially fiction, not only contain illustrations, but are made up of illustrations in ways that make it impossible to separate out the logocentric text and still retain the full textual meaning created by the combination of illustration and verbal text. Is there a point to transcribing the text from a book like this and analyzing it as something separate from the illustrations? Will that actually tell us something meaningful about the book? And what happens when we lose the visual effect of the calligraphy used? This is something that extends far beyond Edo books, although they highlight these issues very effectively. I wonder what is lost in digitally transcribing manuscript texts, for example, no matter how much TEI markup is used to attempt to retain the format of the original, because we lose handwriting in the process. And what about Western illustrated books?

Transcribing a text into machine-readable format necessarily prioritizes the verbal text over the visual, at least with our currently available techniques of analysis, that operate only on linguistic elements. Moreover, there is also a massive amount of human time and effort required to transcribe such texts, even if we decided that it were a worthwhile enterprise, since of course OCR can’t even process most pre-WWII Japanese books, let alone kuzushiji and handwriting. But you may point out that we already have transcribed versions of Edo books, in the form of reprints and those included in anthologies, and some are typeset in modern fonts rather than facsimile reproductions. Thus, the question of transcription into modern typeset format, whether digital or print, is a broad one. Are these kinds of transcriptions helpful or problematic, or both? Do they preclude or facilitate interpretation and study of these texts? And, importantly, what type of interpretation do they make possible or impossible? I argue here that transcribing texts into digital format, with or without accompanying images, necessarily limits our understanding of those texts by taking out their essential visual elements, without which we cannot fully interpret or give meaning to them. However, my argument for looking, then, at page images of these books, also precludes machine analysis like natural language processing.

I don’t have answers for which is more important, but I think it’s key to consider how we represent texts digitally in creating surrogates and how that both facilitates and limits our analysis of them. I don’t want to imply either that this is limited to the specific subset of books from the Edo period that are illustrated in such a way. What do we do with Western books that have illustrations? Meiji books? Modern Japanese books? What happens to the study of literature in its material context when we are looking primarily at an Aozora text? And finally, why not take into account the material aspects of the digital text in our consideration and study of it? It is not as though digital texts have no material embodiment, although they are often described that way. I hope this presentation has gone at least some of the way in attempting to ascribe material qualities to digital texts, concrete qualities that shape the possibilities of our interpretation, the very questions we are able to ask about texts. And I want to continue to emphasize the differing material and intellectual qualities of the two types of surrogates I’ve talked about, page images, and machine-readable text. It’s crucial to keep these differences in mind as we think about what is more important, whether the text is readable to humans in an optimal format (what I would argue is the high-quality page image), or readable to a machine in a format that can be searched and processed on a larger scale, and necessarily in plain or marked-up text in order to make those analyses possible.

One quality of the machine-readable text that stands out is its emphasis on the text itself, taking it out of its original material context and erasing that context to a much greater degree than do high-quality page images. (As I already discussed, there are many digitized page images that are of such dubious quality that they cannot properly be called digital surrogates, any more than could a plain text filled with typos. So here I’m not considering them alongside high-quality plain text.) Yet this de- and re-contextualization are perhaps necessary in order to perform machine-based analysis, especially across large corpora, given that machines are not to the point where they can OCR and understand photographed pages – and even the process of OCR is stripping the text of its context. Are we to always prioritize the linguistic text so that we are able to use models like the topic model, and what happens when we analyze large corpora where we have no other options? What do we lose and what do we gain? Again, I don’t have a conclusive argument that one or the other method – distant reading at the expense of losing layers of meaning, or close reading at the expense of being able to analyze a larger number of texts – is better or more legitimate. Rather I want to raise these issues to our attention and for discussion, and to emphasize the importance of digital form for both limiting and broadening the kinds of questions that we can ask of books.

However, I would like to conclude with a suggestion. My own methodology leans toward starting a research project with more distant reading, in order to get a sense of the field and what is available to me, and also to begin to form hypotheses about a set of books. I can imagine that using topic modeling or gathering information from catalog metadata about a large number of books could be a very useful starting point. But I don’t think that replaces close reading, and it’s necessary to investigate those hypotheses with closer reading in order to either support or refute them with concrete evidence. Moreover, studying the book itself adds back in context to, for example, the linguistic text that may have been analyzed by computer. So I see techniques like topic modeling and other semantic analysis methods as a starting point, but not the ending point, of research. I would be interested to hear your thoughts on this, other combinations of distant and close reading, and on the possibilities and constraints of different types of digital surrogates.