19.1.12

Acquiring Another Dataset

This week I focused on growing my dataset of images.
I looked around online for existing databases of music images and found a few collections of mostly historic music.  The main problem with these collections is that they are set up so that you can easily browse, but downloading large amounts is difficult. I emailed the people in charge of the Digital Scores and Libraries over at Harvard regarding the optimum way to download their collection, but I have not received a reply yet.

I ended up downloading a collection of public domain music from the Cantorion collection.  I was lucky enough to find an open directory listing with a large amount of PDFs from this collection.  I then wrote a simple web scraper in Python which went through the files in the web directory and downloaded all the PDF files to my hard drive.  This yielded a total of 681 PDF files, which (after splitting into individual images using ImageMagick) should be an ample size dataset for now.

1 comment:

  1. Remember to check if the paper from Portugal has a public dataset you can use.

    ReplyDelete