Abstract
Data science pipelines are a sequence of data processing steps that aim to derive knowledge and insights from raw data. Data science pipeline tools simplify the creation and automation of data science pipelines by providing reusable building blocks that users can drag and drop into their pipelines. Such a graphical, model-driven approach enables users with limited data science expertise to create complex pipelines. However, recent studies show that there exist several data science pitfalls that can yield spurious results and, consequently, misleading insights. Yet, none of the popular pipeline tools have built-in quality control measures to detect these pitfalls. Therefore, in this paper, we propose an approach called Pitfalls Analyzer to detect common pitfalls in data science pipelines. As a proof-of-concept, we implemented a prototype of the Pitfalls Analyzer for KNIME, which is one of the most popular data science pipeline tools. Our prototype is model-driven, since the detection of pitfalls is accomplished using pipelines that were created with KNIME building blocks. To showcase the effectiveness of our approach, we run our prototype on 11 pipelines that were created by KNIME experts for 3 Internet-of-Things (IoT) projects. The results indicate that our prototype flags all and only those instances of the pitfalls that we were able to flag while manually inspecting the pipelines.