For this week’s ML practitioner’s series, Analytics India Magazine got in touch with Agnis Liukis from Latvia, who is a Kaggle Grandmaster ranked 14th in the global leaderboard. He shares his insightful journey, tips and tricks in this interview.
How It All Began
Despite featuring in top 10 of the highly competitive Kaggle contests, Agnis, surprisingly, is neither a data scientist nor a machine learning engineer. He currently works as a Lead Software Architect in TietoEVRY, Latvia where his main focus is on developing web solutions for the backend and frontend. His foray into machine learning is a combination of keeping up with trends and his love for programming and mathematics.
He has a Master’s degree in Information Technology, and all his Kaggle accomplishments are the direct result of his zeal to know more. Agnis is self-taught, and his journey into Machine Learning began five years ago when data science as a career was getting more attention globally. Agnis found this domain to have a promising future and started taking online courses. He started with the course from Andrew Ng, which he also recommends to beginners.
However, he soon realised that theory would only take him so far. He wanted to get hands-on experience of dealing with machine learning models. His search for high-quality machine learning challenges landed him in the coveted Kaggle community.
“At the beginning, there’s a lot of information to learn. One must learn theory to understand different algorithms, their weak and strong sides, and when to use which of them. They must also know things like what’s overfitting and how to deal with it. And many more things. And in parallel, there’s a lot of tools and frameworks available – they will have to be learned as well,” says Agnis, remembering his initial days.
Though he wasn’t new to Python, it took him some time to get used to Pandas and NumPy. And other popular libraries to be successful in competitions, like LightGBM, XgBoost,Scikit-learn,Matplotlib etc. Today, after having participated in more than 100 competitions on Kaggle, Agnis has eight gold medals and is currently ranked 14th out of 1,39,500 participants.
“One doesn’t have to be an expert to start competing. In my first competition on Kaggle, I got a silver medal based on my knowledge in Math and Probabilities.”
Agnis says that the yearning for competition comes to him naturally. Be it sports or computer games, he likes to compete. “So it was quite natural for me to enter the world of Competitive Data Science once I decided to learn Data Science,” says Agnis.
When it comes to approaching a problem on Kaggle, Agnis says that he would first read and try to understand the underlying problem, the data and its purpose. When it comes to data exploration, he underlines the importance of looking at raw data, which is essential, especially for competitions. “Many EDAs in Kaggle are typically approaching data from “heights” – looking at some general statistics, distributions, trends and so on. But sometimes, key insights are hidden in raw data values.”
“For example, some competitions have a special pattern of digits after comma – like many values ending with .33333, .66666 and so on, which give some clues about how this data was obtained and how to use that information to improve the score. And things like this can’t be seen from general statistics, but only from raw data,” explained Agnis.
When some working pipeline is ready, he would usually make some initial submissions to calibrate the cross-validation, which he considers to be an essential strategy as it helps in better generalisation and allows to avoid overfitting on the public leaderboard. He reveals that falling during shake-up when a private leaderboard is revealed is one of the most threatening things all competitors are afraid of. Hence, good cross-validation is a key factor in avoiding this.
Achieving the top spot in Kaggle takes time, says Agnis. He stresses on the importance of taking out some time to learn things, to explore data, to write code for models, do experiments, read and explore ideas in a competition forum. Time is a significant differentiator, and he laments about lacking the same right now as he has a full-time job and a family. “Basically I’ve exchanged most of my evenings of ‘watching TV’ for evenings of ‘competing on Kaggle’. That’s more fun and also more useful,” quipped Agnis.
Agnis looks at Kaggle as a great way to stay up-to-date with all bleeding-edge technologies and approaches in Data Science and Machine Learning.
Tips, Tricks And Tools
For competitions, Agnis typically works with his home computer (16 GB RAM), which he considers to be enough for most of the problems, or at least for tabular data and NLP. For computationally intensive contests, he prefers to create a virtual machine on Google Cloud Platform with desired power.
However, he reminds us that lately, many Kaggle competitions are in “Code competition” format, requiring submissions to be made through Kaggle kernels, and he often uses resources offered by Kaggle, which is about the same power as his home computer, but with an advantage of launching multiple kernels in parallel.
“Fail fast when testing new ideas. If an idea is not working, simply forget about it and try something new.”
To avoid getting caught up in non-working ideas for weeks and wasting a lot of time, Agnis recommends the participants to move on to the next idea as soon as possible. That said, Agnis does admit to giving his failed ideas one final shot before scrapping them just in case there are any bugs.
When it comes to libraries, Agnis expresses his fondness for LightGBM, mainly due to its speed and low memory requirements. LightGBM is also his primary option for all tabular data problems. For neural networks, he prefers the popular Keras library. For the rest of Data Science tasks in Python, he finds Pandas, NumPy, Scikit-learn, Matplotlib to be quite handy.
Talking about how smart one needs to be to go from good to great, Agnis took the example of one of the competitions, which he has won by teaming up with Evgeny Patekha. The competition titled ‘Sberbank Russian Housing Market’ deals with predicting house prices in Moscow. Agnis recollected how hard the dataset was and how difficult it was to get stable cross-validation due to the small size of data and many significant outliers.
The summary of his thought process that fetched him gold:
there were two significantly different types of products there, and critical factor was to notice that and train separate models for each type;
only one of those product types contained outliers, therefore, using only the other type for CV worked well;
The strongest feature when predicting price there was the full area of the house, but it contained a lot of errors in data (like typos, unrealistic values, etc.). So, remove the most powerful feature and include it in some derived features.
Future Of ML And Why Creativity Triumphs
“I can’t imagine ML disappearing from recommendation systems, risk scoring applications, automated text translation, and many other fields.”
When asked whether the hype around machine learning is real, Agnis stayed loyal to his opinion from the initial days about how promising data science would be. He opined that machine learning is here to stay and its importance will only grow in the coming years as we collect more data, leverage more computer power, tweak algorithms and make them affordable.
He also speculates that there will be tremendous opportunities for the AutoML field as many companies cannot afford hiring good data scientists for building models from scratch. However, he also warns us that many people think machine learning to be some kind of super algorithm which can solve problems perfectly. For instance, he explains, using computer vision for medicine cannot be taken to be 100% perfect as the impact of failure can be huge. But, an ML model can assist a doctor in assessing results more confidently.
“ML helps only if it is used in a smart way; understanding what it can do and what it can’t.”
For machine learning aspirants, he suggests getting hands-on with real ML tasks or competitions as soon as possible because courses and theories may seem clear and straightforward. But, when one is tasked with 1 million rows of raw data, the real training begins!
For the participants who are looking to top the Kaggle leaderboards, Agnis advises that using good models just won’t win the competitions. “To stand out and get some real advantage, it is necessary to do something different, find something that others didn’t notice. And creativity is something that really helps in this. Another thing is patience and endurance, as most creative ideas won’t work and it can take many iterations and experiments to get things working,” concluded Agnis.
Source - AIM