ducklake is an R package that brings versioned data lake infrastructure to data-intensive workflows. Built on DuckDB and DuckLake, it provides ACID transactions, automatic versioning, time travel queries, and complete audit trails.
Why DuckLake?
Many industries rely on flat-file workflows (CSV, XPT, Excel, etc.) that create significant data management challenges:
- Disconnected flat files: Related datasets stored as separate files despite being inherently relational
- Lost audit trails: No automatic tracking of who changed what and when
-
Version control gaps: Multiple dataset versions scattered across folders with unclear provenance
- Reproducibility issues: Inability to recreate analyses from specific time points
- Collaboration friction: Multiple analysts working with different versions of the same data
- Compliance challenges: Difficulty demonstrating data integrity and audit trails for regulated industries
DuckLake solves these problems by implementing a versioned data lake architecture that:
- Preserves relational structure between related datasets
- Automatically versions every data change with timestamps and metadata
- Enables time travel to recreate analyses exactly as they were run
- Provides complete audit trails with author attribution and commit messages
- Supports layered architecture (bronze/silver/gold) for data lineage from raw to analysis-ready
- Allows multiple team members to collaborate safely with shared data
Quick example: Layered data workflow
library(ducklake)
library(dplyr)
# Install the ducklake extension (requires DuckDB v1.3.0 or higher)
install_ducklake()
# Create a data lake in a temporary directory
attach_ducklake("my_data_lake", lake_path = tempdir())
# Bronze layer: Load raw data exactly as received
with_transaction(
create_table(mtcars, "vehicles_raw"),
author = "Data Engineer",
commit_message = "Initial load of raw vehicle data"
)
# Silver layer: Apply cleaning transformations
with_transaction(
get_ducklake_table("vehicles_raw") |>
mutate(cyl = as.character(cyl)) |>
create_table("vehicles_clean"),
author = "Data Engineer",
commit_message = "Clean and standardize vehicle data"
)
# Gold layer: Create analysis dataset with business logic
with_transaction(
get_ducklake_table("vehicles_clean") |>
mutate(
efficiency = case_when(
mpg < 15 ~ "Low",
mpg < 25 ~ "Medium",
TRUE ~ "High"
)
) |>
create_table("vehicles_analysis"),
author = "Data Analyst",
commit_message = "Create analysis-ready dataset with efficiency categories"
)
# Update the silver layer with additional transformations
with_transaction(
get_ducklake_table("vehicles_clean") |>
mutate(gear = as.integer(gear)) |>
replace_table("vehicles_clean"),
author = "Data Engineer",
commit_message = "Add gear type conversion to silver layer"
)
# View the analysis dataset
get_ducklake_table("vehicles_analysis") |>
select(mpg, cyl, efficiency) |>
head(3)
#> # Source: SQL [?? x 3]
#> # Database: DuckDB 1.4.4 [tgerke@Darwin 23.6.0:R 4.5.2//private/var/folders/b7/664jmq55319dcb7y4jdb39zr0000gq/T/RtmpOkTzSg/duckplyr/duckplyr2e58131c620d.duckdb]
#> mpg cyl efficiency
#> <dbl> <chr> <chr>
#> 1 21 6.0 Medium
#> 2 21 6.0 Medium
#> 3 22.8 4.0 Medium
# View complete audit trail across all layers with author and commit messages
list_table_snapshots()
#> snapshot_id snapshot_time schema_version
#> 1 0 2026-02-09 21:12:29 0
#> 2 1 2026-02-09 21:12:29 1
#> 3 2 2026-02-09 21:12:29 2
#> 4 3 2026-02-09 21:12:29 3
#> 5 4 2026-02-09 21:12:29 4
#> changes
#> 1 schemas_created, main
#> 2 tables_created, tables_inserted_into, main.vehicles_raw, 1
#> 3 tables_created, tables_inserted_into, main.vehicles_clean, 2
#> 4 tables_created, tables_inserted_into, main.vehicles_analysis, 3
#> 5 tables_created, tables_dropped, tables_inserted_into, main.vehicles_clean, 2, 4
#> author commit_message
#> 1 <NA> <NA>
#> 2 Data Engineer Initial load of raw vehicle data
#> 3 Data Engineer Clean and standardize vehicle data
#> 4 Data Analyst Create analysis-ready dataset with efficiency categories
#> 5 Data Engineer Add gear type conversion to silver layer
#> commit_extra_info
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
# Time travel: Query the silver layer as it existed at snapshot 2 (before updates)
get_ducklake_table_version("vehicles_clean", version = 2) |>
select(mpg, cyl, gear) |>
head(3)
#> # Source: SQL [?? x 3]
#> # Database: DuckDB 1.4.4 [tgerke@Darwin 23.6.0:R 4.5.2//private/var/folders/b7/664jmq55319dcb7y4jdb39zr0000gq/T/RtmpOkTzSg/duckplyr/duckplyr2e58131c620d.duckdb]
#> mpg cyl gear
#> <dbl> <chr> <dbl>
#> 1 21 6.0 4
#> 2 21 6.0 4
#> 3 22.8 4.0 4
# Clean up
detach_ducklake("my_data_lake")Medallion architecture
ducklake implements a layered data architecture (medallion pattern) that ensures data quality and traceability:
- Bronze layer (raw): Data exactly as received from source systems—preserves original data for audit trails
- Silver layer (cleaned): Standardized, cleaned data with transformations and validations—the trusted source for analysis
- Gold layer (analytics): Business-logic datasets optimized for specific analyses, dashboards, or reports
Each layer is automatically versioned, providing complete data lineage from raw source through to analysis-ready datasets. This approach enables:
- Complete audit trail: Original data preserved alongside all transformations
- Reprocessability: Reprocess from bronze if cleaning logic changes without re-extracting from source
- Data lineage: Clear progression from raw → cleaned → analysis-ready
- Validation: Compare layers to verify transformations
- Quality assurance: Separate concerns between ingestion, cleaning, and analysis
Learn more
Check out the pkgdown site for detailed vignettes:
- Clinical Trial Data Lake - Start here: Complete workflow from SDTM to ADaM with regulatory artifacts
- Cookbook - Quick recipes for common operations
- Modifying Tables - Two approaches for table modifications
- Transactions - ACID transaction support
- Time Travel - Query historical data
Key features
- Versioned data lake: Every data change automatically tracked with timestamps and metadata
- Lightweight snapshots: Create unlimited snapshots without frequent compacting steps
- Medallion architecture: Bronze/silver/gold layers for data lineage and quality
- ACID transactions: Atomic updates with concurrent access and transactional guarantees over multi-table operations
- Time travel: Query data exactly as it existed at any point in time—essential for reproducibility
- Performance-oriented: Uses Parquet columnar storage with statistics for filter pushdown, enabling fast queries on large datasets
- Schema evolution: Adapt table schemas over time as requirements change
- Tidyverse interface: Familiar dplyr syntax for data manipulation
-
Two complementary approaches:
rows_*functions for data.frames and pipeline functions for dplyr workflows - Complete audit trails: Who changed what, when, and why—suitable for regulated industries
- Seamless integration: Works with duckdb, duckplyr, and the broader R ecosystem
