23.1.12

Acquiring Another Dataset Pt. II

From the portugal paper --"the test set adopted for the qualitative evaluation of the proposed method is the one presented in (Dalitz et al., 2008) and already described."

Dalitz raises and answers some questions in his 2008 paper about a dataset:

"How do we measure the distance of a given segmentation from a perfect 'ground truth' segmentation, and how do we obtain the ground truthing data?"

"Even though the labeling of the ground-truth data could be done manually, this is very time consuming and has the disadvantage of an ad-hoc classification of dubious pixels belonging both to a staffline and a crossing symbol. Therefore, we generate our music images from postscript images created with music typesetting software, which allows for “perfect” staff removal."

Dalitz's data set is available over here (along with another handwritten data set).

19.1.12

Acquiring Another Dataset

This week I focused on growing my dataset of images.
I looked around online for existing databases of music images and found a few collections of mostly historic music.  The main problem with these collections is that they are set up so that you can easily browse, but downloading large amounts is difficult. I emailed the people in charge of the Digital Scores and Libraries over at Harvard regarding the optimum way to download their collection, but I have not received a reply yet.

I ended up downloading a collection of public domain music from the Cantorion collection.  I was lucky enough to find an open directory listing with a large amount of PDFs from this collection.  I then wrote a simple web scraper in Python which went through the files in the web directory and downloaded all the PDF files to my hard drive.  This yielded a total of 681 PDF files, which (after splitting into individual images using ImageMagick) should be an ample size dataset for now.