Motivation#

Why ddlglot exists and when to use it.

Origin#

ddlglot was born from a practical problem in data pipelines built with Ibis. While Ibis excels at transforming and analyzing data across multiple backends, its DDL (Data Definition Language) capabilities are limited.

When creating tables in a data warehouse, you often need features that go beyond simple schema definition:

Partition columns with specific storage layouts
File formats and compression settings
Location paths for external tables
Table properties for optimization or data lake features
Dialect-specific syntax for Spark, BigQuery, Hive, etc.

Ibis’s create_table() requires data to insert and does not expose a complete DDL API for these features. The common workaround is to write SQL as raw strings:

# Common workaround: raw SQL strings
ddl = f"""
    CREATE TABLE IF NOT EXISTS {schema}.{table} (
        id INT PRIMARY KEY,
        name VARCHAR({length}),
        created_at TIMESTAMP
    ) USING DELTA
    PARTITIONED BY (created_at)
    LOCATION '{location}'
"""
con.raw_sql(ddl)

This approach works, but introduces several problems.

The Problem with Raw SQL#

Writing DDL as raw strings has significant drawbacks:

Security risk: Interpolating values into SQL strings can lead to SQL injection vulnerabilities if not handled carefully. Even with careful escaping, it is easy to make mistakes.
No validation: Syntax errors are only discovered at runtime, when the SQL is executed against the database. Typos and dialect-specific mistakes can cause pipelines to fail.
No type checking: Column names, types, and constraints are just strings. There is no autocompletion, no IDE support, and no compile-time checking.
Poor maintainability: As schemas evolve, raw SQL strings become scattered across the codebase. Refactoring is error-prone and tedious.
Dialect lock-in: SQL that works for Spark may not work for BigQuery or PostgreSQL. Migrating to a different backend requires rewriting all DDL strings manually.

Why Not Use an ORM Instead?#

Several ORM and migration tools exist for Python. Here is why they may not be the right fit for data engineering use cases:

Tool	Limitation
SQLAlchemy / Alembic	Designed for application models and incremental migrations. Auto-generated migrations require constant manual editing. Partitioned tables are problematic (see SQLAlchemy #539, open since 2019). Conditional DDL per dialect is cumbersome.
Django ORM	Only supports PostgreSQL, SQLite, and MySQL out of the box. No support for cloud-specific features like BigQuery partitioning or Spark table formats.
Ibis	`create_table()` requires data or an expression to insert. Does not expose a complete DDL API. You still need `raw_sql()` for many use cases.
Raw SQL strings	Security vulnerabilities from SQL injection. No validation, type checking, or IDE support. Difficult to migrate between dialects.

ORMs are excellent for defining application models where Python classes map to database tables. However, data engineering often requires schemas that do not map directly to Python objects:

Staging tables for ELT pipelines
Aggregated tables with specific partitioning schemes
External tables pointing to data lake storage
Temporary tables for incremental processing

In these cases, you need full control over the DDL, and ORMs fall short.

When Not to Use ddlglot#

ddlglot is not a replacement for ORM tools in application development. Do not use ddlglot when:

You are defining domain models that map to application entities
Your team uses Django, SQLAlchemy, or similar for application data access
You need database introspection to compare models against existing schemas

Use ddlglot when you need complete control over DDL for data engineering use cases where ORMs do not provide the necessary features.

The Solution: ddlglot#

ddlglot provides a fluent, type-safe builder API for generating DDL statements. It uses SQLGlot under the hood to translate the generic AST to dialect-specific SQL, supporting Spark, BigQuery, Hive, PostgreSQL, DuckDB, SQLite, and more.

from ddlglot import create

ddl = (
    create("table")
    .if_not_exists()
    .name("analytics.events_enriched")
    .column("event_id", "BIGINT", pk=True, not_null=True)
    .column("user_id", "BIGINT", not_null=True)
    .column("event_type", "VARCHAR(50)")
    .column("created_at", "TIMESTAMP")
    .partitioned_by("event_year", "event_month")
    .location("s3://warehouse/analytics/events_enriched/")
    .using("delta")
    .tblproperties({
        "delta.autoOptimize.optimizeWrite": "true",
    })
    .sql(dialect="spark")
)

Benefits#

Aspect	Raw SQL	ORM migrations	ddlglot
Security	Injection risk	Safe	Safe (no interpolation)
Validation	Runtime only	Limited	Type hints + tests
Dialect support	Manual	Partial	Automatic via SQLGlot
Advanced features	Depends on you	Basic	Complete (partitions, props, etc.)
Testing	Difficult	Migration tests	Unit tests
Maintainability	Hardcoded strings	Models + migrations	Fluent builder

Practical Example: Ibis + ddlglot for Spark#

This example shows how ddlglot complements Ibis in a Spark data pipeline. Ibis handles data transformation, while ddlglot defines the complete DDL for the destination table with features that Ibis does not support.

import ibis
import ddlglot

# 1. Connect to Spark and load raw data
con = ibis.spark.connect(...)
raw = con.table("raw_events")

# 2. Define transformations with Ibis
enriched = (
    raw.select(
        raw.event_id,
        raw.user_id,
        raw.event_type,
        raw.payload,
        raw.created_at,
    )
    .filter(raw.event_type.isin(["purchase", "refund"]))
    .mutate(
        event_year=raw.created_at.year(),
        event_month=raw.created_at.month(),
    )
)

# 3. Define DDL with ddlglot
# Features that Ibis does not support: partitioning, location, format,
# table properties, etc.
ddl = (
    ddlglot.create("table")
    .name("analytics.events_enriched")
    .column("event_id", "BIGINT", pk=True, not_null=True)
    .column("user_id", "BIGINT", not_null=True)
    .column("event_type", "VARCHAR(50)")
    .column("payload", "STRING")
    .column("created_at", "TIMESTAMP")
    .column("event_year", "INT")
    .column("event_month", "INT")
    .partitioned_by("event_year", "event_month")
    .location("s3://warehouse/analytics/events_enriched/")
    .using("delta")
    .tblproperties({
        "delta.autoOptimize.optimizeWrite": "true",
        "delta.enableChangeDataFeed": "true",
    })
    .sql(dialect="spark")
)

# 4. Execute DDL to create the table
con.raw_sql(ddl)

# 5. Insert transformed data with Ibis
con.create_table("analytics.events_enriched", enriched)

This approach gives you the best of both worlds:

Ibis: Expressive data transformation with a unified API across backends
ddlglot: Complete DDL control with dialect-specific features and type safety

Comparison with Similar Tools#

Other Python tools exist for generating SQL. Here is how ddlglot compares:

Tool	DDL focus	Multi-dialect	Fluent API
sqlglot (raw)	Yes	Yes	No (imperative AST)
SQLAlchemy (DDL)	Partial	Yes	Yes
ddlglot	Yes	Yes	Yes

sqlglot is the engine behind ddlglot. While sqlglot provides powerful AST manipulation, its API is imperative and requires understanding the internal expression tree. ddlglot wraps sqlglot with a fluent builder interface that makes common DDL patterns simple and readable.

What ddlglot Does NOT Do#

ddlglot is a DDL generator, not an executor. It does not:

Execute SQL against a database
Manage database connections
Handle migrations or schema evolution
Provide database introspection

This is by design. Separation of concerns keeps ddlglot lightweight and backend-agnostic. You can use ddlglot alongside any database client:

# Generate DDL
ddl = create("table").name("t").column("id", "INT").sql(dialect="postgres")

# Execute with your preferred client
con.raw_sql(ddl)              # Ibis
cursor.execute(ddl)            # DBAPI2 (psycopg2, sqlite3, etc.)
engine.execute(text(ddl))     # SQLAlchemy
!psql -c "$ddl"               # CLI tool

This flexibility means you can integrate ddlglot into existing workflows without changing your database stack.

Motivation

Contents