Aside from my little experiments with medieval travel writers, I’ve recently been given an assignment with which I can put my new command line skills to good use. I’m currently running into a few obstacles, but hopefully with practice (and a little advice?) I can find a way to overcome these.
The goal is to download pictures from an online archive to use in a geospatial history project. I have found four different fonds/series that I want to retrieve pictures from. I initially hoped to download these files using the same methods I learned here and here.
The organization of the particular website I am using makes batch downloading tricky. Rather than being organized into directories and sub-directories, this Drupal-based website’s pages are identified as nodes. For example, the photo series of the Benna Fuller fonds is found at the URL http://windjammer.algomau.ca/main/node/20947. But the first photograph in the series, rather than being listed as main/node/20947/identifier_of_picture, is main/node/19689 – another arbitrary node. The second problem is that the URLs for the photos in this series are not sequential. What I mean by that is that the photos aren’t listed as nodes 19689 to 20056, but each picture has an arbitrary number. The final problem is the fact that, while the initial page for the Benna Fuller photo series shows a long list of links for photos, not all of these links actually end in a useable image. These links bring you to a record page for each photo, and only some of these records actually have a jpeg attached to them.
This is what I am trying to do.
- Direct wget to the Benna Fuller photo series (and the three other series I want to download from).
- Tell wget to look through each link on the list of photos found at the start page of each photo series.
- If any of these links go to a page with an attached jpeg, download the jpeg (and associated metadata so I know where it came from and what it is).
- Save all of the results to a folder shared between my virtual machine and my laptop so I can use the images for my upcoming project.
I decided to start by downloading only the page main/node/20947. I saved the output to a file called Benna_Fuller. I used the less command to read the results. After browsing through a few pages of HTML I found the list of photo records I was looking for. I hoped that by trimming away the extraneous text and HTML tags from this section, I would be left with a list of nodes to send to wget. After that, I planned to see if I could identify the pages that had attached images, and make a final list of images for wget to download. However, the Benna_Fuller file’s text doesn’t wrap around to fit my screen, so I was unable to see the tail ends of my links. No links means no lists for wget and no images for me – unless I want to click on each of the 300+ links in the Benna Fuller photo series and see if each individual one leads to a useful photo. Clearly automation is a good option for this project, but I’m not sure how to get it to work the way I want.
At the risk of getting an “RTFM” response: is there a way to make sure the output of my wget command displays in a way that is fully readable on my command line screen? And is there a better way than my tiered wget process to get around the problem of this website’s nodal organization?