Presented AutoFeat at ICDE
Published:
In May, I attended ICDE in Utrecht, The Netherlands, where I presented my research paper on automated feature discovery - AutoFeat.
Published:
In May, I attended ICDE in Utrecht, The Netherlands, where I presented my research paper on automated feature discovery - AutoFeat.
Published:
During my summer trip to NYC, I had the incredible opportunity to visit the IBM Thomas J. Watson Research Center, thanks to an invitation from Dr. Horst Samulowitz.
Published:
In June, I attended SIGMOD in Santiago, Chile, where I presented the key insights from a user-study with data professionals from the industry on their workflow to create training datasets for ML applications.
Published:
In October, I attended CIKM in Boise, Idaho, where I presented my demo, the human-in-the-loop version of AutoFeat.
Published:
During my time at DuckDB Labs Amsterdam, I had the pleasure to work with Gabor Szarnyaz on a blog post about DuckDB tricks. Check out the blog post on DuckDB website.
Published:
🎓 The beginning of 2025 marks the end of my PhD journey 🎓
Published in ICDE, 2021
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use.
Download here
Published in VLDB Demo, 2021
In this demonstration we present its functionalities and enhancements: i) a scalable system, with a user-centric GUI, that enables the fabrication of datasets and the evaluation of matching methods on schema matching scenarios tailored to the scope of tabular dataset discovery, ii) a scalable holistic matching system that can receive tabular datasets from heterogeneous sources and provide with similarity scores among their columns, in order to facilitate modern procedures in data lakes, such as dataset discovery.
Download here
Published in VLDB PhD Workshop, 2021
As data is produced at an unprecedented rate, the need and ex-pectation to make it easily available for the end-users is growing. Dataset Discovery has become an important subject in the data management community, as it represents the means of providing the data to the user and fulfilling an information need. Since the end-user is the one that needs the information and knows what type of information to look for, little has been done to involve the user in the discovery process.
Download here
Published in CIDR Abstract, 2022
Data science workflows often require extracting, preparing and integrating data from multiple data sources. This is a cumbersome and slow process: most of the times, data scientists prepare data in a data processing system or a data lake, and export it as a table, in order for it to be consumed by a Machine Learning (ML) algorithm. Recent advances in the area of factorized ML, allow us to push down certain linear algebra (LA) operators, executing them closer to the data sources. With this work, we revisit classic data integration (DI) systems and see how these fit into modern data lakes that are meant to support LA as a first-class citizen.
Download here
Published in ICDE DBML Workshop, 2022
Machine Learning (ML) applications require high-quality datasets. Automated data augmentation techniques can help increase the richness of training data, thus increasing the ML model accuracy. Existing solutions focus on efficiency and ML model accuracy but do not exploit the richness of dataset relationships. With relational data, the challenge lies in identifying join paths that best augment a feature table to increase the performance of a model. In this paper we propose a two-step, automated data augmentation approach for relational data that involves: (i) enumerating join paths of various lengths given a base table and (ii) ranking the join paths using filter methods for feature selection. We show that our approach can improve prediction accuracy and reduce runtime compared to the baseline approach.
Download here
Published in EDBT Demo, 2023
The increasing need for data trading has created a high demand for data marketplaces. These marketplaces require a set of valueadded services, such as advanced search and discovery, that have been proposed in the database research community for years, but are yet to be put to practice. In this paper we propose to demonstrate the Topio Marketplace, an open-source data market platform that facilitates the search, exploration, discovery and augmentation of data assets. To support filtering, searching and discovery of data assets, we developed methods to extract and visualise a variety of metadata, as well as methods to discover related assets and mechanism to augment them. This paper aims at presenting these methods with a real deployment of the Topio marketplace, comprising hundreds of open and proprietary datasets.
Download here
Published in ICWE, 2023
With this paper, we report on the effort to engineer and develop an open-source modular data market platform to enable both entrepreneurs and researchers to setup and experiment with data marketplaces. To this end, we implemented and extended existing methods for data profiling, dataset search & discovery, and data recommendation.
Download here
Published in ICDE, 2023
In this work, we present a vision of how to bridge the traditional data integration (DI) techniques with the requirements of modern machine learning. We explore the possibilities of utilizing metadata obtained from data integration processes for improving the effectiveness and efficiency of ML models. Towards this direction, we analyze two common use cases over data silos, feature augmentation and federated learning.
Download here
Published in ICDE, 2024
This paper proposes a novel ranking-based feature discovery method called AutoFeat. Given a base table with a target label, AutoFeat explores multi-hop, transitive join paths to find relevant features in order to augment the base table with additional features, ultimately leading to increased accuracy of an ML model. AutoFeat is general: it evaluates the predictive power of features without the need to train an ML model, ranking join paths using the concepts of relevance and redundancy.
Download here
Published in SIGMOD HILDA Workshop, 2024
In this paper, we share key insights into the practices of feature discovery on tabular data performed by real-world data specialists derived from our user study. Our research uncovered differences between the user assumptions reported in the literature and the actual practices, as well as some areas where literature and real-world practices align.
Download here
Published in CIKM Demo, 2024
In this paper, we introduce the first user-driven human-in-theloop feature discovery method called HILAutoFeat. We demonstrate the capabilities of HILAutoFeat, which effectively combines automated feature discovery with user-driven insights. Our demonstration is centred around two scenarios: (𝑖) an automated feature discovery scenario – HILAutoFeat acts as a steward in a large data lake where the user is unaware of the quality and relevance of the data, and (𝑖𝑖) a scenario where HILAutoFeat and the user work together – the user drives the feature discovery process by adding his domain and business knowledge, while HILAutoFeat performs the intensive computations.
Download here
Undergraduate course, University 1, Department, 2014
This is a description of a teaching experience. You can use markdown like any other post.
Workshop, University 1, Department, 2015
This is a description of a teaching experience. You can use markdown like any other post.