This post was written by Christophe (CTO and founder)
Doing research in deep learning is fascinating. Sometimes, you can have an idea when you get up, implement it during the morning, and get the result two days after. Of course, most of the time, the results are not so good, but it’s always a step for a better comprehension of deep neural networks. In practice, to achieve a significant improvement, we need to run a lot of experiments, often hundreds. Long experiments + lot of iterations : we needed a framework to build a successful research team.
We are five people in the research team at niland. Our job is to create new deep neural network architectures and algorithms capable of analysing and understanding audio signals. Thanks to companies like Amazon, Google or IBM, we have access to an incredible amount of computational resources. We also have a lot of data from our selected partners with high quality human annotations. Having been doing this job for about 12 years now, I must say that niland lab is, by far, the best place I experienced for doing research in artificial intelligence.
A Machine Learning First Company
At Niland, we put machine learning at the core of everything we do. When we started the company back in 2013, we wanted to use machine learning to replace every existing recommendation system. We are still working on it, enabling machines to learn what I want to listen to, and not only playing with statistical laws to serve me tunes. In our quest for the perfect Music Perception Engine, we needed to position the research team as close as possible to our clients (Music Companies : labels, production libraries & streaming services).
Understanding our clients’ business problems is key to assure the success of our product. Therefore, we needed to create and define clear indicators to measure our performance and how these indicators answered a clear business need. As a small team with a high goal (teaching machines to understand music is not an easy task), we also needed to make the most of our key assets: Niland’s people. We wanted to design the company in a way that product & core research work together, iterating altogether through a virtuous loop. Having worked for years in academic labs, we knew which pitfalls we wanted to avoid.
Common Pitfalls in Machine Learning Team
I have spent about 10 years in machine learning labs before we started the company and I had the chance to exchange with world class researchers from famous laboratories like MIT or CRIM. I have found that machine learners had a common way to work, even if they are working on very different topics. They also face the same common problems that, in my opinion, significantly slow down their work and could make them miss great discoveries.
Computers do not run at 100%, day and night.
We all have a lot of experiments to run. If you discuss five minutes with a researcher about his work, he would usually be able to list dozens of new ideas he wants to explore. And as you may know, in machine learning, even great ideas rarely work at the first run. We need to run a lot of experiments with different settings to get the best out of our ideas or to discard a wrong research path. But if you take the time to watch the activity logs of the lab’s computers, you’ll be surprised to see how low the computing usage is. For me, the main reason for this contradiction comes from the way people launch experiments. And there is room for optimization here.
Researchers do not share their experiment code.
In a research team, people exchange a lot of ideas during coffee time or meetings. But they do not share their experiments code easily. Running a colleague’s experiment means getting his codes from him, his data and his computer environment. That could take several hours. That rarely happens also because people used to work on different research ideas. They work on their own intuitions, on ideas they believe in. Asking a colleague his experiment codes could be perceived like your are trying to double-cross him on his own idea. That’s not easy. And that’s why we have to organize this sharing and put it in the center of the research workflow.
Bad results are lost, forever.
Researchers are not over enthusiastic about speaking about their failed experiments. Bad results are rarely shared with the rest of the team. This is a classic discussion I have about failed experiments.
(me) Hey John*, what about your last experiment on the X algorithm?
(John) Well.. the X algorithm? Humm, I’ve tried it a week ago but Igot bad results.
(me) Are you sure it was not from the settings?
(John) Yes, I’m quite sure.
(me) How many experiments did you run?
(John) I don’t remember, five I guess…
(*This story is purely fictitious, and resemblance to existing persons, living or dead, is coincidental)
Bad results are demotivating. But often, the problem does not come from the algorithm itself but from its implementation or from its settings. If we decide to discard an idea just because we got bad results on the first experiments, we take the risk to miss a discovery. And in our job, discoveries are scarce. That’s why it’s really important to archive failed experiments. Maybe a colleague with a different perspective could make it work.
It’s often hard to retrieve the best settings.
Another common pitfall is to lose the best experiment settings. Not being able to reproduce the experiment your ran two weeks ago, with such good results, is a super frustrating experience.
Evaluation made on changing datasets.
Automatic evaluation is the keystone of Machine Learning First companies. It represents the link between the client satisfaction and the machine learning team. At Niland, we spend a large part of our time improving our evaluation process, to be as close as possible to our clients’ needs. Once the automatic evaluation is defined (in collaboration with the product team), researchers can focus on their job and work to improve the score on the metric.
But as a counterpart, the metric often changes which makes comparison between systems difficult. If we do not take care about that, it could be impossible to know if our current system is better than the one we had six months ago.
Researchers spend a lot of time watching their algorithm running.
I know that a lot of us are used to do this strange thing (I used to do it a lot!): we start a new experiment and stare at our terminal console to see how our algorithm is converging. We all know that it’s totally useless and we’d better wait for the end of the experiment, but it is stronger than us, we watch log every five minutes, instead of working on the next experiment.
The Niland lab framework
We worked hard to get the best out of our situation: five smart researchers, almost unlimited computational resources, high quality data. We created a tool to run, monitor and share our deep learning experiments. The framework is composed of a REST API, a dashboard interface, and a set of scripts. The goal is to centralize, distribute and monitor our experiments.
The key features are :
- Researchers submit their experiments
- The platform distributes the experiments on the available work servers
- Results are available through a dashboard
- All experiments are archived, reproductible and documented
- Every researcher can clone any experiment, modify it, and submit it again
This tool changed the way we worked, by many aspects: by increasing the collaboration and interactions within the research team, it multiplied the number of experiments we launch every week and really speeded up the research to production process. Here’s what we learnt from our work with the Niland lab framework and some best practices we got from it.
Our first need was to centralize our experiments’ results through a dashboard. This is really important to discuss about the experiments we’ve conducted. We use to do it every Friday. And it’s interesting to notice how many different interpretations we can make from a score table results. That helps us deciding what is the best research direction for the next week.
But, to me, the most important feature is the ability to clone everyone’s experiment. When one of us finds something relevant, and gets a significant improvement, we can easily clone it and use it as the new baseline. It allows us to test our idea on our best system and cumulate the improvements of all the team. As an illustration, the following graph shows our November deep learning research. Aloïs worked on semi-supervised learning, Martin worked on adding some noise on the spectrogram during training while I was trying to add some new reconstruction cost.
Martin and Aloïs got significant improvements and we decided to merge the systems on the 13th. We were pleased to see that the improvements of Aloïs and Martin worked well together (which is not always the case). Then, we continued to work on our topics, using the merged best system as the new baseline.
All the systems are archived. It helps a lot to make statistics on past experiments. It’s also useful when we develop a new metric. We can easily evaluate all past experiments with this new metric. It’s always a pleasure to see our progresses since the beginning of the company.
Another big issue for us was to find a way to get the best of all the computational resources from our partners. We want to warmly thank AWS and IBM who gave us an access to tens of GPUs for years. We designed our framework to distribute experiments on all the available servers. It’s also nice to be able to submit 20 experiments with different settings in a few minutes right before the weekend and to get the results on Monday. In May, we ran experiments on 25 GPUs during one month and got results which would not have been achieved without these computational resources.
From research to product
This tool allows us to get the research closer to the product and to our clients. We work with the product team to define new metrics which will fit with the clients’ needs. Then we can discuss about the advances on research with both our sales and marketing guys. The research dashboard is also a great tool to decide together when it’s time to push the new system into production. We linked this framework with our Continuous Integration system (based on jenkins). We are quite glad to be able to push a new system into our production server in less than an hour.
We recently added a web page dedicated to perceptual audio evaluation of our music recommendation system (listening tests sessions). This allows us to check if the improvement on our metrics significantly improves the perceived quality. We can also share this test with some of our client/partners. This is really helpful to get the research team involved into clients’ satisfaction.
Everything at Niland, from the research lab to the end product, is designed to improve our agility. Now that we have this whole framework to conduct research, what could we improve upon? Well, the last piece missing could be an hyper-parameter optimization system. Because fine tuning an experiment is a repetitive task, which can be automatized with expert level performance. And it’s always a pleasure to be relieved of such repetitive tasks, to have more time to create new algorithms.