Revealing the Semantics of Data Wrangling Scripts With COMANTICS

Kai Xiong, Zhongsu Luo, Siwei Fu, Yongheng Wang, Mingliang Xu, Yingcai Wu

View presentation:2022-10-19T14:00:00ZGMT-0600Change your timezone on the schedule page
2022-10-19T14:00:00Z
Exemplar figure, described by caption below
COMANTICS is a three-step pipeline that automatically detects the semantics of data wrangling scripts by inferring the types of data transformations with their parameters. COMANTICS first generates intermediate input and output tables for each line of code and detects changes between them. Then, it identifies the transformation type through characteristic-based and CNN-based components. Last, it infers parameters for the transformation by employing a “slot filling” strategy.

Prerecorded Talk

The live footage of the talk, including the Q&A, can be viewed on the session page, Transforming Tabular Data and Grammars.

Fast forward
Abstract

Data workers usually seek to understand the semantics of data wrangling scripts in various scenarios, such as code debugging, reusing, and maintaining. However, the understanding is challenging for novice data workers due to the variety of programming languages, functions, and parameters. Based on the observation that differences between input and output tables highly relate to the type of data transformation, we outline a design space including 103 characteristics to describe table differences. Then, we develop COMANTICS, a three-step pipeline that automatically detects the semantics of data transformation scripts. The first step focuses on the detection of table differences for each line of wrangling code. Second, we incorporate a characteristic-based component and a Siamese convolutional neural network-based component for the detection of transformation types. Third, we derive the parameters of each data transformation by employing a "slot filling" strategy. We design experiments to evaluate the performance of COMANTICS. Further, we assess its flexibility using three example applications in different domains.