The Travels of Sir John Mandeville, Knight is an anonymous medieval travel document. It follows the progress of one Sir John Mandeville as he explores the world, encountering exotic places, people, and monsters along the way. I have decided to use this text to practice the techniques I’m learning on the command line. I will keep posting as I try new things.
You may think you know all about Mandeville’s journeys, but I guarantee you’ve never heard about his recent encounters with the mysterious realm of Linux!
Part 1: Humble Beginnings
The first thing I did was download the plain text file from the Internet Archive using wget. I used mkdir to create a directory called “experiments,” and within that, a directory called “Mandeville.” I renamed my text file Mandeville_Travels.txt.
I trimmed off the header and footer. I then used cat to create a pipeline which removed punctuation, carriage returns, translated spaces to newlines, translated capitals into lowercase letters, sorted the words into a list and removed duplicates. I used the tee command to save the output to Mandeville_Travels-uncountedlist.txt; then I performed uniq with the option c and saved the output to Mandeville_Travels-countedlist.txt. I now had two lists; one of unique words, the other of unique words and how many times each word occurs.
After this, I made a list of words that don’t occur in the installed dictionary using /usr/share/dict/american-english, the uncounted list and fgrep. By doing this, I hoped to have a list which contains more of the words I’m really interested in – place names, people, and monsters. Finally, I created a permuted term index using the “clean” file I made in the very first step.
I picked a word that I wanted to see in context – “Amazonia.” I used egrep -i to search the permuted term index. Voila! It is mentioned seven times; four times within an explanation of its location, and twice in conjunction with its Queen. I then searched “angel” – it picked up angel, angels, and the Latin angelus. The list was long enough to take up more room than was on the screen, and I wasn’t sure how to scroll up to see the uppermost entries since using the arrow keys only brought up commands from the history. That’s a problem I will need to learn how to fix.
So now I have a lot of lists. They only get me so far. I don’t need to create a word list to figure out what kind of text this is; I already know that. However, it would be interesting to know what places, people, and creatures are mentioned in this particular travel document, as well as the contexts in which these words are found. I can search my numbered list of unique words and my non-dictionary-compliant list to find things that I want to grep in my concordance. But combing through these lists to find things to look up in the concordance is itself a slow and tedious task. I wonder if I can make a list of words that appear more than X times and aren’t in the dictionary. And I’d like to know how to automate this process. Can I write a script that will download files, process them and create these lists automatically? And what do I do with the results?