Identifying the Song
Uncertainty
Using the selected source of truth, the task is then to match the existing files to those in the data. We can use track-specific metadata (see context) to identify the individual song.
For example, if we know the track number, we can trivially as we copy from the data source to the local file. Yet, for the song title or duration, the program must deal with uncertainty.
The song title has the potential for spelling mistakes in the existing metadata. As the discogs database is crowd sourced, the source of truth may also have mistakes.
An equality comparison could also fail with the song duration. A live recording of a song would likely be different to the studio release.
Tangent into Fuzzy Logic
In researching solutions to this problem of uncertainty, I first encountered Fuzzy Logic. The topic captured my interest and I spent some time exploring the space of solutions to my problem.
Many of the algorithms I found were over-engineered for my application. Some methods took context or sentiment into consideration when evaluating text similarity. These were not suited to uncertainty song titles in the abstract.
Chosen Approach
I instead chose a much more reductionist technique. Instead of using a complex model to check similarity, I used an edit based algorithm.
By calculating the number of edits required to transform one string to another. Where an edit is either an insertion, deletion or substitution.
I chose the Levenshtein Distance Algorithm which calculates the number of single character edits required to change one string to another.
This works well for the types of error likely to occur such as spelling mistakes, omitted characters etc.
Duration Uncertainty
Due to constraints on the project, I never implemented duration comparisons. A similar approach would measure the numerical distance between two durations to inform the decision.
This would expand the scope as it may allow identification of songs from an album using the only duration.