This is getting awk-ward

Every week in our Digital Research Methods class, I’m more impressed by the skills we are learning. The little commands we started with at the beginning of the course are now starting to come together in ways that are more immediately useful within a typical research workflow. This week’s lesson covered the basics of manipulating .csv files, including some use of the programming language awk.

Writing executable files is one of the most exciting things that I have learned how to do in this class. It’s great to be able to save frequently used strings of commands as programs that can be applied from one project to the next. This makes me want to return to The Programming Historian to learn about other programming languages. However, while I have noticed that different languages offer different capabilities, it is also becoming clear that it won’t do to be a “jack-of-all-trades, master-of-none.” I am having little difficulty following the lessons each week, but I find that the small differences between the lessons and other scenarios can easily trip me up. This became apparent as I tried to solve the challenge at the end of this week’s lesson: to use awk to write a program that would only print records from our .csv file that were created before 1920. I started by using our list of field numbers to identify the donor date as being located in field 23. I then creating the following file in vi. (Actually, this is only the latest of several versions.)

#! /usr/bin/awk -f
BEGIN {
FS="@"
}
$34 ~ /FOOD/ && ($23 < 1920) { foodsum++ }
$34 ~ /MATERIALS/ && ($23 < 1920) { matsum++ }
$34 ~ /MEDICINES/ && ($23 < 1920) { medsum++ }
$34 ~ /POISONS/ && ($23 < 1920) { poisonsum++ }
END {
print "Food ", foodsum
print "Materials ", matsum
print "Medicines ", medsum
print "Poisons ", poisonsum
}

My first use of this didn’t work, since my results were the same as if I had never added the column 23 conditions.

I believe I may have found the source of the problem as I searched through the search-results-fixed-noheader.csv file. I discovered that one of the fields seemed to include dates preceded by other numbers in sequences like this: 14.1951. This appears to correspond with the donor date. With dates appearing in this fashion, it’s no wonder that they all appeared to awk as being less than 1920. This creates at least two new puzzles: a) how do I extract only the last four digits from the date field, and b) how do I get these results into awk. I know that with a command like find or egrep, I can search for a regular expression that matches the strange dates (something like “^[:digit:]*\.[:digit:]{4}$” maybe?). Now I need to find a way to either display only the final four digits, or to move them so they occur at the beginning of the field rather than at the end, and return the results to the original .csv. I feel like I can probably figure this out, given some experimenting with commands like sort and join, although I haven’t had time to try this today. What I did instead is manually remove extra date digits in the first several records with vi. This gave some strange results when I ran my “pre1920-sample” program. My count of “Medicine” records dropped by 1, but most of the pre-1920 records I came across appeared as Materials, and that count didn’t change. I know I didn’t mess up the Medicine records with my manual edits, because the original program (without the field 23 additions) brings the Medicine count back up. Clearly my program is doing something, but what is it doing?

screen2013-11-15
If I end up overcoming my technical ineptitude and solving the puzzle, I will write a follow-up post about it. I have the feeling that I’m making things much more complicated than they need to be. This is only going to improve if I keep practicing, so I intend to do as much of that as I can even after this semester is over.

Although my attempts to use these lessons in “real life” hasn’t always worked perfectly, I am happy to say that I know a lot more now than what I did at the beginning of September. I’m especially looking forward to learning how to create spiders, and I hope that this is something I can transfer from lessons to application without too many hiccups in between. Despite my awk-wardness, I remain optimistic!

Advertisements

One thought on “This is getting awk-ward

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s