HOW MUCH IS KAGGLE RELEVANT FOR REAL-LIFE DATA SCIENCE?



Kaggle is the most popular platform for data science competitions, and it had enormous growth in the last decade. The in-depth knowledge gathering is what makes Kaggle one of the most valuable platforms for aspiring and professional data scientists. The competitions can prove to be a practical learning experience for data scientists.


But Kaggle is not everything. It would be a misnomer to say you could take a Kaggle solution and make it a part of your production pipeline or build a business around it. At the same plenty of people come up with very innovative solutions for their real-world problems led mostly by winning solutions.


One of the biggest concerns is that Kaggle is not a solid place to learn applied data science, and about how people often pursue success on Kaggle to brag about to prospective employers. While there is nothing wrong about that approach, it is also essential to remember Kaggle only focuses on one part of the ML pipeline.


In an interview with Analytics India Magazine, Mathurin Aché, a Kaggle master told that Kaggle contests mostly focus on the performance aspect of models. Whereas to develop an ML product, things like access to data, preprocessing, refinement of models in accordance with the customers, periodic monitoring to improve models and a whole bunch of other challenges surface.


Dealing With Real Life Complexity


Relying just on Kaggle also means model tuning on a dataset already premade and built for smooth consumption for competitions. Real-world data is almost always messier than what competition presents, and the reason why this happens is a large part of the data science workflow is controlled on Kaggle. Kaggle does not take into account model complexity or real-world issues related to deployability.


Kaggle may consequently lead to the romanticisation of data science, which expands the preexisting gap between expectation and reality of a data science job. The truth is that at the end of the day, the role of a data scientist is to solve a business problem, and Kaggle may not necessarily teach that. 


Darragh Hanley, Kaggle Grandmaster told AIM how the Kaggle experience has come in handy in his own professional work. But, he followed it up by saying that the challenges in the real-world get more complex. So, Kaggle success should not be substituted for expertise at the industry-level. According to Darragh, while Kaggle helps one learn how to approach problems, working in the industry helps learn what questions to answer in the first place because once a data scientist has the right questions and the right data, most often simple algorithms are sufficient to solve a problem. 


But the reality is once you put models into production, you can see the degradation in online performance compared to performance on validation data. This shows that the validation stage may be overemphasised, and the model needs to be continuously updated using more recent data. Meaning, if the bulk of a data scientist’s job is composed of maintaining machine learning models in a production environment rather than validating the accuracy of models, you may not pick up much from Kaggle for applying it in the enterprise environment. 


Kaggle Was Never Meant To Simulate Real-Life Data Science Challenges


It is clear that the data challenges in the real world situation are more complex than online competitions. Online hackathons and Kaggle competition might not paint the precise picture, and the success at these competitions should not be confused for expertise at the enterprise-level. While Kagglers understand what the data science work requires, they may become frustrated when they see that what they acquired on Kaggle was only a part of the real job function. 


Kaggle is the biggest platform for data scientists and machine learning practitioners, and therefore gives aspirants the best practical exposure to the complex world of data science. Most experts, nevertheless, have great admiration towards the Kaggle community for the way it facilitates the upskilling of an amateur through free courses, forums and kernels.


The value and learning exposure takes place not just in the form of marginal improvement in AUC score, but from a useful understanding of business problems, identification and utilisation of the right data, and implementation of the model during the contest. This exposure is incredibly relevant, especially as it is done in a highly competitive environment like Kaggle.


After the end of each competition, the winners post what they did throughout the competition, and very often they also share the code. Kaggle can help a data scientist or machine learning professional dig deep into post-competition writeups, given those professionals have some real skills and competitive-edge. 


However, Kaggle was never intended to copy machine learning and data science in the real world. If someone is looking for extensive exposure to different types of data and feature engineering techniques, learning how to iterate model building more quickly, be connected to a remarkable community of data scientists, then Kaggle is a great place.


Source - AIM

Recent Posts

See All

Drop Me a Line, Let Me Know What You Think

© 2019 by DoThink