Read thousands of primary sources with programs you write yourself

A colleague at Algoma University asked me recently if I found my basic training in digital history to be helpful. In my typically long-winded fashion, I sent a huge email back to him enthusiastically describing all the exciting projects I embarked on and the tools I used to complete them. But if I had to choose just one thing to be most proud of, it might be something that I did quietly in the background: writing my own distant reading program. And you might be surprised to learn that, with a few basic concepts under your belt, you could do the same thing for your next research project.

How does distant reading work? It uses programs to analyze texts (this is why distant reading is also called textual analysis or text mining) and give you information about them. This information can help you determine what the text is about, when it was written, and whether or not it would be useful for you as a source. If you’re looking at language statistics, you can also find things like frequencies of words or phrases, and the relationships between them. But where things get really interesting is when you analyze a lot of texts together. Using a computer can reveal patterns, relationships, and omissions in large bodies of work that you wouldn’t have the ability or time to spot otherwise. Many historians are now looking to so-called “big data” to make new and exciting discoveries. But don’t just take my word for it – if you’re not convinced of the potential of distant reading yet, let Adam Crymble’s two-minute thesis change your mind:

My approach

What I put together is a lot simpler than Crymble’s work, but the same concept applies: tech can illuminate text. I used distant reading to build research momentum when I was involved in curating a new exhibit.

The exhibit, now installed at Fanshawe Pioneer Village, uses a historic home to replicate the office, pantry, and parlour of Dr. William Anson Jones, who practiced in Clandeboye Ontario in the early 1900s. The research process required my colleagues and I to learn everything we could about rural doctors and their practices at the time. I was especially interested in the process of making medicine, tools of the trade, professionalization of doctors, regulation of medicine (or lack thereof!), and the social attitudes around medicine.

My approach was to download periodicals that Dr. Jones would have been likely to subscribe to, then run a program to identify frequently occurring key words and display them in a concordance. This allowed me to notice trends in the body of text without reading through thousands upon thousands of pages.

Want to know how I did it? Read on…and try it for yourself! You’d be amazed at what you can do.

Step 1: Getting your database together

The key to distant reading is the database. Online databases are where you’ll most likely find your starting material. When you gather all the sources you want to analyze, you’re creating a personalized database for your programs to work with. The more comprehensive the database is, the more information your program will sift through.

For the Dr. Jones project, I downloaded a few key years’ worth of issues from four different periodicals on Early Canadiana Online. I found them through a routine search. (But did you know that you can also write programs to find and download sources for you? Dr. William J. Turkel has some great how-tos you can follow to do just that.)

Step 2: Write out what you need to do

Ok, now you have a bunch of files sitting on your computer, waiting to be analyzed. What do you want to do with them? Writing a list is an easy to way to get your steps organized before you attempt to write code. For my project, the list might look something like this.

For every pdf file in the folder “Jones/Canadian-Druggist-1900”:

  • Turn the pdf into a txt file. [This is the file type your program will want to work with].
  • Find every unique word and count its frequency. [Medical words that occur frequently are ones I can flag to pursue further.]
  • For each word in a list of flagged key words, create a concordance that shows every instance of the word with 100 characters before and after it. [This helps me understand how those important words occur in context.]
  • Display progress updates throughout the process so I know what the program is doing.

Step 3: Turn your to do list into a (bash) script

Here’s where the rubber meets the road.

There are a lot of different methods you could use to turn your list into a program that works. Specifically, I was taught basic bash scripting, so that’s what I’m using here. A bash script is a small program that can be read by Linux and UNIX-based operating systems. What you need to know is how it works.

All operating systems already know how to run a large set of commands to do various tasks. For example, if you type “ls” in the Linux command line and press Enter, the computer will list the contents of the current folder (or directory).

Simple, right? Many commands can be given arguments – “list the contents of the directory Jones/Canadian-Druggist-1900” (ls ./Jones/Canadian-Druggist-1900) – and options – “list the contents of the directory Jones/Canadian-Druggist-1900 sorted by file size” (ls -S ./Jones/Canadian-Druggist-1900). The first step to being able to write a program is knowing some basic commands

When you want to put together a process that involves a long series of commands, the easiest thing to do is write a script which explains the process to the computer without you having to type each command individually every time. That’s the beauty of programming.

Back to our to-do list. The first requirement is to turn our pdfs into text files. Linux has a command for that, conveniently named “pdftotext.” I want the new text file to have the same name as the old pdf, just with .txt at the end. I want the computer to say “Processing [name of my file]” so I always know what it’s up to. And to keep my research organized, I want to move all the text files into their own subfolder called “TextFiles.” Here’s what it looks like written out as a bash script. I’ve highlighted all the commands. I also used a for loop to tell the computer to run the program until the commands have been applied to all the files in the folder.

#!bin/bash (this lets the computer know you're using bash)
#Every line that starts with # is ignored (commented out)

#Turn pdfs into txt
mkdir ./TextFiles/ ;

for file in ./*.pdf ;
   do pdftotext $file ${file/%.pdf/.txt} ;
   mv ${file/%.pdf/.txt} ./TextFiles/ ;
   echo Processing $file ;

Step 4: Run with it!

Great! We have text that lists a bunch of tasks in language the computer will get. Now we want to test it.

Save your work with the extension “.sh”. On the Linux command line, you can turn your text into an executable file by entering “chmod 744.” (This gives you permission to read, write, and execute, but anyone other than the owner can only read the file. You can change the number to set different permissions.)

Now we can input the name of our program and run it just like any other command. It’s a good idea to test things every so often – or even split large steps into separate programs – so that if something goes wrong, you can identify the problem easily.

Step 5: Adding more steps

That’s all good and fine, but we still haven’t done any distant reading yet. Here’s my entire program to fulfill the list. Remember, we’re trying to:

  • Turn pdfs into text files;
  • Find every unique word and count its frequency;
  • For each word in a list of flagged key words, create a concordance that shows every instance of the word with 100 characters on either side;
  • Display updates throughout the process.

#Turn pdfs into text
mkdir ./TextFiles/ ;

for file in ./*.pdf ;
   do pdftotext $file ${file/%.pdf/.txt} ;
   mv ${file/%.pdf/.txt} ./TextFiles/ ;
   echo Processing $file ;

#Count word frequencies in the text files 
mkdir ./TextFiles/WordFreqs/ ; 

for file in ./TextFiles/*.txt ; 
   do tr ' ' '\n' < $file | sort | uniq -c | sort -nr > ${file/%.txt/-wordfreqs.txt} ; 
   mv ${file/%.txt/-wordfreqs.txt} ./TextFiles/WordFreqs/ ; 
   echo Sorting $file ; 

#Create combined text and word frequency files 
echo Combining Files ; 
cat ./TextFiles/*.txt > all.txt ; 
tr ' ' '\n' < all.txt | sort | uniq -c | sort -nr > allfreqs.txt ; 

mv all.txt ./TextFiles/ ; 
mv allfreqs.txt ./TextFiles/WordFreqs/ ; 

#Create a concordance with 100 characters on either side 
echo Assembling Concordance ; 
ptx -f -w 100 ./TextFiles/all.txt > allptx.txt ; 
mv allptx.txt ./TextFiles/ 

#Look up key words in the concordance from a document called "wordlist.txt" 
n=$(wc -w ./wordlist.txt | grep -o '[0-9]\{1,3\}') 

for i in $(seq 1 $n) ; 
   do if [ -s ./wordlist.txt ] 
      then word=$(head -n1 ./wordlist.txt) ; 
      echo "Finding $word" ; 
      sed -i '1d' ./wordlist.txt ; 
      echo $word >> ./worddone.txt ; 
      egrep -i "[[:alpha:]] ${word}" ./TextFiles/allptx.txt > ./TextFiles/${word}find.txt ; 

#Move results into a folder called "Find" 
mkdir ./TextFiles/Find ; 
mv ./TextFiles/*find.txt ./TextFiles/Find

With this program I was able to have my computer pull out some useful information from four early-20th-century periodicals. Had I been better at programming, I could have done a lot more. But I did discover that abdominal surgeries and cancers were of particular interest to medical professionals. I found out that plastic capsules were beginning to be used for pill making, which was a question my team mates and I were unsure about. I learned about types of narcotics that were used by doctors, and I also noticed a lack of writing about doctors in rural areas. The findings that I obtained from running this program helped me to determine directions for future efforts, which involved more traditional research in books and archives.

In this overview I won’t describe every line of the program in detail, since there are plenty of other websites devoted to teaching you how the commands work step by step (see the next section). But I will say that all this code essentially just breaks down our to-do list into small tasks that the computer can accomplish. You can use the same concepts to write your own programs in many different programming languages. The more commands you know, the more work you can tell your computer to do. Plus, it’s fun! I love the challenge of translating my ideas to code and learning something new.

What if I don’t want to write my own program?

That’s ok. There are plenty of talented people who are experts at text mining and have provided their tools for us to use. Voyant Tools is a great place to start.

But if you change your mind…these places have lots of beginner tutorials including history-specific ones.

Text mining: the research of the future?

In this post I’ve provided one small example of how historians can make computers work for them. A little bit of programming can bring new bodies of work within reach. That being said, distant reading is by no means a substitute for more orthodox methods of research. The best approach, in my opinion, is to use wide variety of methods to get the most results, which is what I did for the Dr. Jones exhibit.

Textual analysis shines in scenarios where you have unwieldy amounts of information to deal with, and you need some way to focus your efforts. There is a danger of using the numbers to make claims they can’t support. But for the adventurous historian, writing a distant reading program is a useful addition to the research toolbox.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s