Case Study

dataGenie

Active build

Non-technical users shouldn't need to write SQL to explore their own data. dataGenie lets you upload a CSV, ask questions in plain English, and get back query results, charts, and written explanations - making data analysis as easy as having a conversation.

Last curated refresh: 2026-05-11 12:39 PM EDT

Project FramingProblem / ContextRole and OwnershipConstraintsArchitectureKey DecisionsIterations and StrugglesLearningsOutcomesNext Improvements

Case Study Map

Project framing

dataGenie is about lowering the skill barrier to data analysis without flattening the work into toy answers. I wanted a system where a non-technical person could ask a question in plain English and still get something that feels analytically serious.

Problem / Context

Most internal analytics tools assume one of two extremes: either the user writes SQL or the product gives them a shallow dashboard that cannot answer new questions. I wanted a middle path: conversational access to real data with enough structure and guardrails to stay useful.

Role and Ownership

I built the backend, query-routing logic, LLM provider layer, data profiling flow, and the product framing for how non-technical users ask analytical questions.

Why now

The recent wave of LLM tooling made the interface side easier, but it also made me care more about routing, fallbacks, and quality boundaries. The interesting work was not 'chat with CSVs'; it was deciding when not to use an agent.

Topline

Constraints

Simple questions needed fast direct answers instead of agent overhead.
Complex analytical questions still needed decomposition and synthesis.
The system had to stay usable even when one model provider failed.

Outcomes

A conversational analytics prototype that routes between direct SQL and agentic reasoning.
A reusable provider abstraction with fallback between Claude, OpenAI, and local models.

What changed because of the project

A conversational analytics prototype that routes between direct SQL and agentic reasoning.

Architecture

Input

Upload a CSV and ask a question in plain English

The product starts from raw tabular data and the exact question the user wants answered.

Prepare

Profile the dataset before reasoning

Schema, nulls, and shape context are captured so the system knows what kind of data it is handling.

Route

Decide whether the question is simple or complex

Easy asks should go straight to SQL; harder asks earn a multi-step reasoning loop.

Answer

Return the result with explanation and charts

The user gets an answer that feels analytical rather than like a generic chat response.

Simple path

Direct SQL in DuckDB handles fast counts, filters, and aggregations.

Complex path

A ReAct loop plans, queries, and synthesizes when the question needs more reasoning.

OutcomeNon-technical users can interrogate real data without writing SQL and without forcing every question through an overbuilt agent loop.

Architecture narrative

The core architecture is hybrid by design. Uploaded data lands in DuckDB after profiling. A lightweight intent layer decides whether the request is simple enough for direct SQL or complex enough to go through an agentic reasoning loop. The LLM layer sits behind a provider abstraction with fallbacks so the product can keep answering even when one provider degrades.

System diagram

This is the secondary view: the system shape behind the flow above. It exists to explain the moving parts, not to substitute for the product story.

Input

User question

plain-English analytics

CSV upload

tabular source data

Preparation

Data profiler

schema, nulls, distributions

Intent router

simple vs complex path

Execution

Direct SQL path

fast answers in DuckDB

ReAct loop

plan, query, synthesize

Provider layer

Claude, OpenAI, Ollama

Output

Answer + charts

query result, explanation, viz

Routing principle

One path for every question either slows simple queries down or underpowers complex ones.

Provider resilience

The model layer is abstracted so failures or rate limits do not collapse the product.

Key Decisions

Hybrid query routing

Do not force all questions through an agent loop.

Simple count, filter, and aggregation queries are faster and more reliable when translated directly.

The routing layer adds complexity, but it keeps the user experience honest.

DuckDB as the analytics core

Use DuckDB instead of a transactional database as the primary query engine.

The product is fundamentally analytical, so columnar execution and local speed matter more than OLTP patterns.

It is a narrower fit, but the fit is much better for ad hoc analytics.

Provider abstraction

Treat LLM vendors as interchangeable infrastructure rather than as the product itself.

Reliability and cost matter too much to tie the experience to one provider.

Slightly more plumbing up front, much better operational control later.

Build Journey

Plain-English analytics MVP

Started from the user problem: helping non-technical people query data without writing SQL.

Routing refinement

The architecture matured when I stopped pretending all questions deserved the same execution path.

Provider resilience

Fallback logic turned the system from a prototype into something that could survive real provider instability.

Struggles

Early agent-heavy designs made easy questions feel slower and more fragile than they should have.

I split the paths so simple questions can stay direct and fast.

LLMs answered better once they understood dataset shape, but that context was missing at first.

I added profiling ahead of query generation so the model sees schema and quality context before reasoning.

Learnings

The smartest architecture is often the one that knows when not to invoke heavy reasoning.

Data products need quality context before language interfaces can be trusted.

Provider redundancy is part of product design, not just ops hygiene.

Latest Work

Latest evolution

Curated baseline

Canonical project snapshot

dataGenie started from a real frustration: every time a non-technical teammate needed data from a CSV, they'd either ask an engineer to write a query or struggle with Excel pivot tables. I wanted to build something where you could just ask "what were the top 5 products by revenue last quarter?" and get back a proper answer with a chart.…