Gathering Data
Context
The metadata for a given song has two components, the track specific data and the common data.
Track-specific data:
- track number (position of song in album)
- song title
- duration
Context Data:
- contributing artists
- album
- release year
To generate complete metadata, we need sufficient data from both components. Track specific data is not specific enough without the context data. The context data does not uniquely identify any song.
Discogs.com API
To fill in the missing metadata, I needed a data source from which to pull the data. I used the discogs.com API to allow me to query their user-built database of music.
I used the Python discogs client to send HTTP requests to the Discogs REST API.
Data Source Selection
The API call returns several data sources for which the provided metadata may match. For some inputs, the correct / best data source was the first match. But for more ambiguous inputs, sometimes the second or third data source may be the correct one.
I took a hybrid approach, including the user in the selection of the data source. This allows the algorithm to consult the user upon receiving the response from the API. The top 3 discogs-releases, best matching to the API query, are then presented to the user.
The goal would be to only consult the user if the algorithm is unsure that the top data source matches the query. This would involve measuring the uncertainty in the returned results.
For most inputs, the algorithm was capable of finding the correct data source on its own. I never implemented the uncertainty measure thus the user was always consulted.