In which I cover how to set up a crawler and let it loose using Scrapy (Python). I also go in to parsing HTML to find metadata within inconsistent and messy markup. A how-to post with some links to code.
- Python itself
- Scrapy web crawling framework
- Beautiful Soup
- regular expressions
How I came to the project
Aozora Bunko is a digital archive of plain-text and XHTML versions of modern Japanese literature, including both fiction and non-fiction. It consists of public domain texts largely from from the Meiji (1868-1912) and Taishō (1912-1926) periods, from a wide variety of authors, both famous and not (at least, now). It is fairly variable in terms of how many works are included by individual authors, at least proportional to their fame and popularity at the time of their writings; for example, the best-selling author of the Meiji period, Ozaki Kōyō, has relatively few works on Aozora. Still, it's an amazing resource and has been valuable for instructors teaching Japanese language overseas, in particular.
I decided to harvest as many texts as I could (in the end, about 50,000 works) from Aozora to collect a broad corpus of Japanese literature, especially since archives like Project Gutenberg have relatively few Japanese-language works and Google Books has no OCR for Japanese, thus limiting our access to these works in machine-readable format. In addition, I have some colleagues who are interested in performing various types of text analysis on Japanese literature, so I wanted to provide them with a corpus we could all experiment on for digital humanities research.
What did I need, then? I needed to figure out how to write a web crawler, extract the main body of the text (without metadata) and also extract relevant metadata contained in the body of the document with the text, and then clean out the HTML (and optionally, pronunciation and other glosses on Chinese characters, called ruby).
How I found and installed Scrapy
Because I program largely in Python these days, I sought out a library that I could use for web crawling in this language. I soon found Scrapy, a "fast and powerful framework" for scraping websites and creating your own web crawler.
Scrapy is easy to install, and functions as a command line tool in addition to having a Python API. It's extremely customizable and I actually just customized the tutorial with my own code and ran it as-is. I'm still figuring out how to write my own separate crawler and also how to use the API, which requires running Twisted asynchronous networking framework for Python. Visit Scrapy's site for installation instructions and documentation on the command line tool using a tutorial.
How to change the Scrapy tutorial to process your own stuff
As I mentioned, I didn't dive deep enough into Scrapy usage to write my own crawler; instead, I modified the tutorial and ran it to obtain my files from Aozora. To do this, you just need to modify the dmoz_spider.py file. I didn't modify anything else; instead of using Scrapy's pipeline I simply had the dmoz_spider.py (main spider code) file open and write files for me, and also extract metadata (which I put in the filename of each file) and the main text body, retaining HTML for later processing out (in the contents of each file). About all I did to modify beyond this was to rename my spider "aozora" and change the site to be Aozora Bunko rather than the site the tutorial crawls, with some rules about what pages to crawl in order to just get texts, not crawl the entire site needlessly.
class AozoraSpider(CrawlSpider): name = "aozora" allowed_domains = ["aozora.gr.jp"] start_urls = [ "http://www.aozora.gr.jp/index_pages/index_top.html" ] rules = [ Rule(LinkExtractor(allow=("sakuhin\w*.html", "card\w*.html", "[0-9][0-9]*.html"), deny=("http://www.aozora.gr.jp/index_pages/index_all.html", "http://www.aozora.gr.jp/index.html", "person\w*.html"), deny_extensions=[".ebk", ".zip"]), follow=True, callback='parse_item') ]
What I did to process - input, change, output files
I modified Scrapy to read in each file, strip out the body of the HTML file (to get the main text), and in the first run, also stripped out glosses, or ruby, by removing their HTML tags and everything in between. (This means that for example, a Chinese character might have a pronunciation associated with it with ruby tags. I removed these and the pronunciation, leaving only the character.) Then I removed the rest of the tags using a simple regular expression. This left absolutely "plain" text with no HTML tags at all.
On the second pass, however, I didn't remove ruby or the HTML tags, because of an issue with using lxml with Beautiful Soup. I'll deal with it later.
Next, I moved on to retrieving metadata and writing it into the filename of the saved text. This was easily the most difficult part, because the metadata is barely structured and contained in a footer of each text file, surrounded by HTML tags. Thankfully, however, it's consistent from file to file (more or less). So I located the div that contained the metadata block, then went in order looking for lines that started with things like 底本： (source document), or 初出：(original place and date of publication). The author and title were not marked but always in the same order at the beginning of the document. I simply split the strings after : to retrieve the metadata rather than the field name. The most difficult thing to retrieve was the year, but I looked for the first four digit sequence after the source or original document name and typically found it that way. I can't say it was always reliable, but it was the best and only way I found to tackle this dirty metadata "scheme."
[Editor's note, August 2021: Back in 2015 when I first posted this tutorial, I said "I'll be uploading this script to Github and will link to it when it's there, because this part of the code is long and complicated." However, that never happened. If you'd like a copy of this script please contact me. It hasn't been uploaded because I didn't get around to it, not because I'm averse to sharing - and I've been done with this project for so long I probably will never come back to it. The code uses libraries that have changed or been replaced in the intervening years, and Aozora itself is mirrored on Github now so there isn't a need to scrape it to download all the content. BUT, what I did in this step would still be needed to clean all the HTML files even if you pulled them from the Github repo, and the logic probably is still applicable.]
I used a one-second delay between each step of the crawl, and it took roughly 24 hours to obtain over 50,000 files. I did this so I would not overwhelm Aozora's site, which is run basically by one guy, and I wasn't sure what kind of hosting they have.
The command I used to run Scrapy was:
scrapy crawl aozora -s JOBDIR=crawls/aozora1from within the tutorial directory. To explain, I'm setting a directory to keep track of the pages crawled so I can pause the job and come back to it later without re-crawling all the pages. It also means I can run the crawl every couple of months to obtain newly-added texts without re-crawling all of Aozora (and taking another 24+ hours to do so). The directory contains the files that keep track of which pages have been crawled in that job.
Post-processing to segment and remove ruby
Of course, plain Japanese text is not very interesting to us. The reason is that there are no spaces between words, so raw Aozora files can't be used with typical visualization and other text analysis "tools" meant for languages with whitespace dividing words, or tokens. The first step in my original script was to remove ruby, using regular expressions, but in my most recent crawl I decided to preserve them so I can later create a database of these glosses and their accompanying glossed words (more below). The original script also used TinySegmenter to tokenize into words. However, going forward I'm using MeCab, another package that tokenizes as well as does POS tagging. It's much more powerful and does a much better job on tokenizing early-20th-century texts.
Here's the code I wrote to de-ruby using BeautifulSoup, a Python library for processing HTML files.
soup = BeautifulSoup(input, "xml") tagname = 'rt' for tag in soup.findAll(tagname): tag.extract() tagname = 'rp' for tag in soup.findAll(tagname): tag.extract() tagname = '!r' for tag in soup.findAll(tagname): tag.extract()
Create a CSV with metadata
What about the metadata that I put in my filenames? Here's what I grabbed:
- Aozora Bunko ID number
- Original place of publication (if available)
- Original publication date (if available)
- Typist for Aozora Bunko (inputter) (if available)
- The name of the source text for the Aozora version (if available)
- Publication date of the source text for Aozora version (if available)
Here's an example of some filenames:
2_素木しづ子_三十三の死_xxxx_xxxx_小林徹_「現代日本文學全集 85 大正小説集」筑摩書房_1957.txt 195_渡辺温_可哀相な姉_「新青年」_1927_森下祐行_「アンドロギュノスの裔」薔薇十字社_1970.txt
Future steps and why I'm interested in this metadata
So why this specific metadata? I'm not just interested in the "original" texts. For a long time, my research has been interested in how texts are passed down to us and what versions we end up with. I hope in the future to look at who has entered what, perhaps talk to those individuals or present them with interview questions to find out more about their interests and backgrounds, and also to look at what the typical year range and type of source text is. For example, do they come from early editions or later edited complete works editions? Other anthologies? Cheap paperback printings? How did Aozora itself come to be constructed, and what is it that we're actually presented with in the archive?
Other colleagues of mine are interested in analyzing the content of the works themselves. They are not as interested in ruby glosses, whereas I hope to make a database of those specifically in the future.
Because of these multiple potential uses, I simply tried to get the most information both in and about the texts, for later filtering depending on one's research interests. I hope this script and corpus can be useful to others going forward, who may be doing research I could never have foreseeen!