This week I focused on growing my dataset of images.
I looked around online for existing databases of music images and found a few collections of mostly historic music. The main problem with these collections is that they are set up so that you can easily browse, but downloading large amounts is difficult. I emailed the people in charge of the Digital Scores and Libraries over at Harvard regarding the optimum way to download their collection, but I have not received a reply yet.
I ended up downloading a collection of public domain music from the Cantorion collection. I was lucky enough to find an open directory listing with a large amount of PDFs from this collection. I then wrote a simple web scraper in Python which went through the files in the web directory and downloaded all the PDF files to my hard drive. This yielded a total of 681 PDF files, which (after splitting into individual images using ImageMagick) should be an ample size dataset for now.
Remember to check if the paper from Portugal has a public dataset you can use.
ReplyDelete