As I begin working on my project involving Taiyō magazine, I thought I'd document what I'm doing so others can see the process of cleaning the data I've gotten, and then experimenting with it. This is the first part in that series: first steps with data, cleaning it, and getting it ready for analysis. If I have the Taiyō data in "plain text," what's there to clean? Oh, you have no idea.
Let's start with the encoding. The files are XML encoded using Shift-JIS. This is a text encoding for Japanese language files that is still being used despite the existence of Unicode; if you've ever been to a Japanese website that renders as gibberish, chances are you didn't change your browser's encoding to Shift-JIS (you can).
Well, I need my data in Unicode (UTF-8), not Shift-JIS. So my first action was to write a tiny Python script to open the files, read in the Shift-JIS text, convert it to UTF-8, then write it back to new files. Easier said than done. Although I didn't investigate it very hard, I was getting a Shift-JIS decode error. This means that the script encountered what it thought was an invalid character and choked. What did I do? Simplest thing for a large corpus where one single word/character isn't going to kill me: I set errors="ignore" in my script and moved on.
Converting XML tags
The thing is, the XML tags were all in Japanese too, and this can make the XML library in Python that I use, etree, not work. I use TextWrangler as my text editor right now, and it has a nice feature - "multi-file search and replace." I selected all 60 issues of my corpus and did a replace for several tags: 記事 (kiji), 引用 (inyo), 踊字 (odoriji), and 外字 (gaiji). (The meaning of those words, if you don't read Japanese, is not exactly important.) I also had to change the attributes of several of those tags to romanized versions, including all the attributes of kiji, which delimits articles. Its attributes are items such as the title of the article, author, what section of the magazine it appeared in, subject matter, and so on. Oh, and I also removed this interesting character that is two horizontal bars, which I do not know how to type in and have since erased, so I can't paste it for you here. It symbolizes something you can't type into the computer (such as an old or variant character that is not included in any character sets as of yet). Ha.
Removing XML tags, Splitting into articles
Now what? We're still a couple of steps away from "plain text." I have 60 files of whole issues of Taiyō. I want to split those into individual files containing single articles, record the metadata (some of it) in the filename as I do so, and also generate a big CSV with all the metadata in it for the articles. So each row of the CSV (which is like a spreadsheet but not Excel-specific) will have the date, issue, author, title, subject, section of the magazine, and writing style of a given article. Here we go.
Step one: Separate into articles. Find all instances of XML tag kiji and select each one at a time; take the text out, and then also take the attributes out of the kiji tag and store them.
Step two: Then, before writing the text out to a file, there are some more XML tags I need to remove:
- s, l, br, gaiji, inyo, and odoriji need to be removed but NOT the text in between the tags
These tags do things like mark sentences, mark location on the page (I think), create a line break, identify certain types of characters, and denote quoted speech.
Step three: Write out the "plain text" with XML tags removed to a file with the naming convention "year_issuenumber_author_title.txt". Store the metadata for writing to CSV later as year, issuenumber, author, title, column, style, and genre.
Repeat for each article.
Digital tools/techniques used
- Python libraries:
- Write out to text files
- Write out to a CSV file
It's not as simple as it might sound to start working with a "machine readable" corpus like NINJAL's Taiyō Corpus. You have to make some decisions: I had to remove a lot of linguistic information that NINJAL had painstakingly entered, and the corpus will no longer work with the analysis software included on the DVD, called Himawari, without the markup. Wouldn't it have been interesting to keep track of quoted speech and printing "errors"? But for my purposes, I'm focused on the "content" (as it were) of a large corpus without caring too much about small linguistic details.
I'm saving multiple versions of my files, including the original Shift-JIS XML files, UTF-8 XML files of whole issues, UTF-8 text files of individual articles NOT tokenized, and finally UTF-8 text files of individual articles tokenized by word.
When I started my initial journey with this corpus, I made some stupid mistakes in parsing it into articles, and didn't save the files I'd created along the way in the intermediate steps. So I had to start all over again from scratch with converting to Unicode, then parsing into articles, and tokenizing. If I had "saved my work" -- i.e., versioned files -- I would have saved myself a lot of work in the re-do of my process. Oh well. At least I've got the files now, and also (miraculously) saved the scripts I created for all of these steps, so I can replicate it and share them in the future.