Publications

Human-in-the-Loop Feature Discovery for Tabular Data

Published in CIKM Demo, 2024

In this paper, we introduce the first user-driven human-in-theloop feature discovery method called HILAutoFeat. We demonstrate the capabilities of HILAutoFeat, which effectively combines automated feature discovery with user-driven insights. Our demonstration is centred around two scenarios: (𝑖) an automated feature discovery scenario – HILAutoFeat acts as a steward in a large data lake where the user is unaware of the quality and relevance of the data, and (𝑖𝑖) a scenario where HILAutoFeat and the user work together – the user drives the feature discovery process by adding his domain and business knowledge, while HILAutoFeat performs the intensive computations.

Download here

Key Insights from a Feature Discovery User Study

Published in SIGMOD HILDA Workshop, 2024

In this paper, we share key insights into the practices of feature discovery on tabular data performed by real-world data specialists derived from our user study. Our research uncovered differences between the user assumptions reported in the literature and the actual practices, as well as some areas where literature and real-world practices align.

Download here

AutoFeat: Transitive Feature Discovery over Join Paths

Published in ICDE, 2024

This paper proposes a novel ranking-based feature discovery method called AutoFeat. Given a base table with a target label, AutoFeat explores multi-hop, transitive join paths to find relevant features in order to augment the base table with additional features, ultimately leading to increased accuracy of an ML model. AutoFeat is general: it evaluates the predictive power of features without the need to train an ML model, ranking join paths using the concepts of relevance and redundancy.

Download here

Amalur: Data Integration Meets Machine Learning

Published in ICDE, 2023

In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning.

Download here

Topio: An Open-Source Web Platform for Trading Geospatial Data

Published in ICWE, 2023

With this paper, we report on the effort to engineer and develop an open-source modular data market platform to enable both entrepreneurs and researchers to setup and experiment with data marketplaces. To this end, we implemented and extended existing methods for data profiling, dataset search & discovery, and data recommendation.

Download here

Topio Marketplace: Search and Discovery of Geospatial Data

Published in EDBT Demo, 2023

The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to practice. In this paper we propose to demonstrate the Topio Marketplace, an open-source data market platform that facilitates the search, exploration, discovery and augmentation of data assets. To support filtering, searching and discovery of data assets, we developed methods to extract and visualise a variety of metadata, as well as methods to discover related assets and mechanism to augment them. This paper aims at presenting these methods with a real deployment of the Topio marketplace, comprising hundreds of open and proprietary datasets.

Download here

Join path-based data augmentation for decision trees

Published in ICDE DBML Workshop, 2022

Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the richness of dataset relationships. With relational data, the challenge lies in identifying join paths that best augment a feature table to increase the performance of a model. In this paper we propose a two-step, automated data augmentation approach for relational data that involves: (i) enumerating join paths of various lengths given a base table and (ii) ranking the join paths using filter methods for feature selection. We show that our approach can improve prediction accuracy and reduce runtime compared to the baseline approach.

Download here

Amalur: Next-generation Data Integration in Data Lakes

Published in CIDR Abstract, 2022

Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, executing them closer to the data sources. With this work, we revisit classic data integration (DI) systems and see how these fit into modern data lakes that are meant to support LA as a first-class citizen.

Download here

Interactive Data Discovery in Data Lakes

Published in VLDB PhD Workshop, 2021

As data is produced at an unprecedented rate, the need and ex-pectation to make it easily available for the end-users is growing. Dataset Discovery has become an important subject in the data management community, as it represents the means of providing the data to the user and fulfilling an information need. Since the end-user is the one that needs the information and knows what type of information to look for, little has been done to involve the user in the discovery process.

Download here

Valentine in action: matching tabular data at scale

Published in VLDB Demo, 2021

In this demonstration we present its functionalities and enhancements: i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery.

Download here

Valentine: Evaluating matching techniques for dataset discovery

Published in ICDE, 2021

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use.

Download here