UCSD CSE 190: Projects in Vision and Learning: Acquiring Another Dataset

This week I focused on growing my dataset of images.
I looked around online for existing databases of music images and found a few collections of mostly historic music. The main problem with these collections is that they are set up so that you can easily browse, but downloading large amounts is difficult. I emailed the people in charge of the Digital Scores and Libraries over at Harvard regarding the optimum way to download their collection, but I have not received a reply yet.

I ended up downloading a collection of public domain music from the Cantorion collection. I was lucky enough to find an open directory listing with a large amount of PDFs from this collection. I then wrote a simple web scraper in Python which went through the files in the web directory and downloaded all the PDF files to my hard drive. This yielded a total of 681 PDF files, which (after splitting into individual images using ImageMagick) should be an ample size dataset for now.

UCSD CSE 190: Projects in Vision and Learning

19.1.12

Acquiring Another Dataset

1 comment:

Contributors

Blog Archive

About