Here is the Second part of a three part series on Music Recommendation and Machine.
Working on music analysis is fascinating. Yet, sometimes, it seems like a thankless job. Not only because it’s a hard, cutting-edge scientific problem, but also because it doesn’t look that hard to many people. When trying to explain the job to the general public, it can easily seem like trying to solve an obvious problem, with nuclear tools.
Why? Maybe because – as human beings – we are really good at listening to music. In this song for example, we can identify in a second that there is a girl singing with a piano on background. We can also say that it’s an acoustic track, laidback, etc.. Based on these impressions, we can compare tracks and build playlists with similar tracks. A no-brainer for a human.
Interestingly, the process in our brain leading to these impressions is unconscious. We immediately know that there’s a girl singing, we don’t have to think about that, it’s right there.
Our first job if we want to build a content based recommendation system is to reproduce this part of the human brain.
What’s the goal?
In order to build a content based music recommendation system, the first core problem to solve is to compute a “music similarity measure” from the audio. We have to build a system able to compare two tracks. Also, this system should give a numerical value to the music similarity – i.e. a perceptive distance between tracks. We can split the job into two tasks:
- analyze a track and extract a numerical representation
- compare the music representations to determine a distance
Our music representation will be a vector (i.e. a list of numbers). Once the representation vectors are extracted from the signals, the perceptual distance between tracks will be estimated by computing the distance between their representation vectors.
Let’s start with a simple audio analyzer: the equalizer
As a first representation, we can measure the quantity of bass and treble of each track.
Let’s try to solve a basic three song problem with our system. Listen to these tracks:
It appears that the first two tracks sound quite similar, as compared to the third. We can now represent these tracks by two-dimensional vectors (bass, treble). Based on these representations we can plot them into a graph: the music representation space.
Track #1 and Track #2 are close according to Track #3, job done!
But, what if we add more tracks?
Some “un-similar” tracks come between Track #1 and Track #2.
To solve this new issue, we have to augment the representation vector with new criteria. We need to extract more information from the music signal if we want to have a great, complete representation of all the feelings shared by the music. The question now is: what kind of information can be relevant for the human perception of similarity?
The musicologist temptations
If we are looking for tracks that sound similar to this one, we will naturally look for songs with a woman singing and a background piano. To do it automatically, we can try to make a content based woman voice detector and piano detector. Then, we can say than two tracks are similar if they have a strong response to our two tag detectors. Of course, we have to build many detectors for describing the diversity of large music catalogs: rhythm patterns, melody, keys (minor, major, etc), instruments, type of voice, etc.
But this approach doesn’t work in practice, for many reasons:
- the content based music detectors are difficult to build and usually have error rates around 10 – 20%, and each error can contaminate the whole representation vector,
- from the set of criteria we defined, it may be incompatible with some music genres – like dubstep for example,
- it is not possible to represent all feelings shared by music with a set of objective criteria. Music goes beyond words.
The best way to deal with automatic music similarity is to construct an abstract music representation, independent of any musicologist priors.
State of the art
According to the MIREX evaluation campaign, the best similarity academic system in the content-based domain can be summarized as follows:
The extraction of the music representation vector consists in:
- computing the spectrogram (intensity of different frequencies at different instants in time)
- extracting short term features i.e. signal properties having a high rate of change (mfcc, onset-coefficients, spectral flatness, Fluctuation Patterns, etc.)
- modeling the statistical distribution of the short terms features and embed them in a vector (using gmm-supervector for example)
This approach usually leads to a music representation having thousands of dimensions. Each dimension contains some abstract information about the song, in the sense that we can’t interpret it in a musicological way.
And in practice, it works quite well. It’s a great way to to build tasks like classifying music in words (instrument, genre etc..) or detect some relevant audio features. But does this kind of information really translate into a good recommendation? Right now, it’s not enough. Still, it could be great for some purposes, like organizing a playlist. In theory, songs could flow into one another like how a DJ would fuse them. However, it would imply that that DJ only organized tracks by BPM, keys and other stuff. The emotional and cultural dimensions of music are just not taken into account. We still have a big A.I. challenge.