Kitana Agentic Wrapper

March – April 2025

← Previous

Built an agentic wrapper on top of the Kitana AutoML system to intelligently select and clean joinable tables under a resource budget.
Designed and tested multiple agentic architectures containing embedding-based pruning and LLM-powered reasoning for table ranking and filtering.
Achieved 12.5% higher R² than embedding-only baselines on synthetic table benchmarks — supporting both no-hop and multi-hop joins.
Implemented full pipeline in Python with OpenAI and Gemini API calls for LLM reasoning and embedding generation.

Project Summary

As data lakes continue to grow, ML engineers face a painful bottleneck: identifying the right tables to clean and join to improve model performance. This project extends Columbia’s Kitana AutoML system, which searches for useful joins to improve a target column’s prediction accuracy.

We proposed an agentic system to decide which tables to clean — under a budget — using past Kitana queries, accuracy improvements, and semantic cues. Our system beats a naive embedding baseline by up to 12.5% in R² and supports both no-hop and multi-hop table selection.

How can we leverage agentic planning to identify the most impactful tables — including multi-hop joins — while staying within a limited data cleaning budget?

Kitana Agentic System: Problem Formulation

Architecture: Selector Agent

Selector Agent architecture for table selection

Selector Agent: Embedding-based pruning + LLM-based enrichment pipeline

Code & Report

📄 Project Report (PDF)

💻 GitHub Repository

What I Learned

Large language models can extract and reason about joinability far beyond what static embeddings offer.
Agentic systems let us think about data engineering as a dynamic planning problem, not a static one-shot task.
Thinking in multi-hop joins opened my eyes to how limited most current data search systems really are.

Contributions and Acknowledgements

This project was completed as part of COMS 6113: Agentic Systems Made Real, a graduate-level research seminar taught by Professor Eugene Wu at Columbia University. I worked alongside Mateo Juliani (msj2164@columbia.edu) and Kaushal Damani (akd2990@columbia.edu).