Introduction
Understanding how DuckLake stores and manages your data is crucial for maintaining a robust data lake. This vignette explains:
- The two-component architecture of DuckLake (catalog and storage)
- What types of files are created and how to inspect them
- Best practices for choosing storage locations
- How to implement backup and recovery strategies
DuckLake’s Two-Component Architecture
DuckLake separates data management into two distinct components:
Catalog (Metadata): A database file (DuckDB, SQLite, or PostgreSQL) that stores all metadata about your tables, snapshots, transactions, and data file locations. This is typically small but critically important.
Storage (Data Files): A directory containing immutable Parquet files that hold your actual data. DuckLake never modifies existing files—it only creates new ones.
This separation provides several benefits:
- Simplified consistency: Since files are never modified, caching and replication are straightforward
- Flexible storage options: Store metadata locally and data in the cloud, or vice versa
- Independent backup strategies: Each component can be backed up differently based on your needs
Storage Options
DuckLake works with any filesystem backend that DuckDB supports, including:
- Local files and folders: Fast access, ideal for single-machine workflows
-
Cloud object storage:
- AWS S3 (and S3-compatible services like Cloudflare R2, MinIO)
- Google Cloud Storage
- Azure Blob Storage
- Network-attached storage: NFS, SMB, FUSE-based filesystems
Storage Patterns
Here are how some common storage patterns may look:
# Local storage - fastest, but not shared
attach_ducklake(
ducklake_name = "local_lake",
lake_path = "~/data/my_ducklake"
)
# Cloud storage - scalable, accessible from anywhere
attach_ducklake(
ducklake_name = "cloud_lake",
lake_path = "s3://my-bucket/ducklake",
data_path = "s3://my-bucket/ducklake/data"
)
# Hybrid approach - metadata local, data in cloud
attach_ducklake(
ducklake_name = "hybrid_lake",
lake_path = "~/data/metadata",
data_path = "s3://my-bucket/data"
)Key considerations:
- Latency vs. accessibility: Local storage is fast but not shareable; cloud storage is accessible but has higher latency
- Scalability vs. cost: Object stores scale easily but may charge for data transfer
- Security: Consider using DuckLake’s encryption features for cloud storage
Inspecting DuckLake Files
Let’s create a sample DuckLake and explore what files it generates:
# Create a temporary directory for our demo
lake_dir <- file.path(vignette_temp_dir, "storage_demo")
dir.create(lake_dir, showWarnings = FALSE, recursive = TRUE)
# Install ducklake extension
install_ducklake()
# Create and populate a DuckLake
attach_ducklake(
ducklake_name = "demo_lake",
lake_path = lake_dir
)
# Add some data with transactions
with_transaction(
create_table(mtcars[1:15, ], "cars"),
author = "Demo User",
commit_message = "Initial load"
)
#> Transaction started
#> Transaction committed
#> Snapshot metadata updated
with_transaction(
get_ducklake_table("cars") |>
mutate(hp_per_cyl = hp / cyl) |>
replace_table("cars"),
author = "Demo User",
commit_message = "Add hp_per_cyl metric"
)
#> Transaction started
#> Transaction committed
#> Snapshot metadata updated
with_transaction(
get_ducklake_table("cars") |>
mutate(mpg_adjusted = if_else(cyl == 4, mpg * 1.1, mpg)) |>
replace_table("cars"),
author = "Demo User",
commit_message = "Add adjusted MPG for 4-cylinder cars"
)
#> Transaction started
#> Transaction committed
#> Snapshot metadata updatedCatalog Files
The catalog is a single database file containing all metadata:
dir_tree(lake_dir)
#> /tmp/RtmpY4o8rb/storage_backups_vignette/storage_demo
#> ├── demo_lake.ducklake
#> ├── demo_lake.ducklake.wal
#> └── main
#> └── cars
#> ├── ducklake-019c4446-cd59-7c85-9621-ed0cb76d34a3.parquet
#> ├── ducklake-019c4446-ce3c-7a33-9c21-a15ed1de2dfb.parquet
#> └── ducklake-019c4446-ce8c-7120-ab58-bc4351e1a2f2.parquetThe catalog files (demo_lake.ducklake and
.wal) contain all metadata about tables, snapshots, and
transactions.
Storage (Data) Files
Data files are stored in Parquet format in a structured directory:
# Data files are organized by schema and table
main_dir <- file.path(lake_dir, "main")
dir_tree(main_dir, recurse = 2)
#> /tmp/RtmpY4o8rb/storage_backups_vignette/storage_demo/main
#> └── cars
#> ├── ducklake-019c4446-cd59-7c85-9621-ed0cb76d34a3.parquet
#> ├── ducklake-019c4446-ce3c-7a33-9c21-a15ed1de2dfb.parquet
#> └── ducklake-019c4446-ce8c-7120-ab58-bc4351e1a2f2.parquet
# Get details about parquet files
parquet_files <- dir_ls(main_dir, recurse = TRUE, regexp = "\\.parquet$")
for (f in parquet_files) {
cat(sprintf(" %s (%s bytes)\n",
path_file(f),
file.size(f)))
}
#> ducklake-019c4446-cd59-7c85-9621-ed0cb76d34a3.parquet (2271 bytes)
#> ducklake-019c4446-ce3c-7a33-9c21-a15ed1de2dfb.parquet (2462 bytes)
#> ducklake-019c4446-ce8c-7120-ab58-bc4351e1a2f2.parquet (2682 bytes)Understanding File Organization
Each table’s data is organized by schema and table, with each transaction creating new Parquet files:
# List all snapshots to see the version history
snapshots <- list_table_snapshots("cars")
snapshots |>
select(snapshot_id, author, commit_message)
#> snapshot_id author commit_message
#> 2 1 Demo User Initial load
#> 3 2 Demo User Add hp_per_cyl metric
#> 4 3 Demo User Add adjusted MPG for 4-cylinder carsThe key insight is that DuckLake never modifies or deletes existing Parquet files. Each change creates new files, preserving the complete history for time travel queries.
Backup Strategies
Backing Up the Catalog
The catalog is the most critical component—it maps snapshots to data files. Regular backups are essential.
Simple File Copy
For local databases, the simplest backup is a file copy:
# Create backup directory
backup_dir <- file.path(lake_dir, "backups")
dir.create(backup_dir, showWarnings = FALSE)
# Copy the catalog file to create a backup
file.copy(
from = file.path(lake_dir, "demo_lake.ducklake"),
to = file.path(backup_dir, "demo_lake.ducklake")
)
#> [1] TRUE
# Copy the data directory as well
dir_copy(
path = file.path(lake_dir, "main"),
new_path = file.path(backup_dir, "main")
)
# Verify the backup was created
dir_tree(backup_dir)
#> /tmp/RtmpY4o8rb/storage_backups_vignette/storage_demo/backups
#> ├── demo_lake.ducklake
#> └── main
#> └── cars
#> ├── ducklake-019c4446-cd59-7c85-9621-ed0cb76d34a3.parquet
#> ├── ducklake-019c4446-ce3c-7a33-9c21-a15ed1de2dfb.parquet
#> └── ducklake-019c4446-ce8c-7120-ab58-bc4351e1a2f2.parquet
# To use the backup, detach the current lake and attach to the backup
# First detach the original
detach_ducklake("demo_lake")
# Attach to the backup location
attach_ducklake(
ducklake_name = "demo_lake",
lake_path = backup_dir
)
# Verify you're working with the backup
list_table_snapshots("cars")
#> snapshot_id snapshot_time schema_version
#> 2 1 2026-02-09 21:20:16 1
#> 3 2 2026-02-09 21:20:16 2
#> 4 3 2026-02-09 21:20:16 3
#> changes
#> 2 tables_created, tables_inserted_into, main.cars, 1
#> 3 tables_created, tables_dropped, tables_inserted_into, main.cars, 1, 2
#> 4 tables_created, tables_dropped, tables_inserted_into, main.cars, 2, 3
#> author commit_message commit_extra_info
#> 2 Demo User Initial load <NA>
#> 3 Demo User Add hp_per_cyl metric <NA>
#> 4 Demo User Add adjusted MPG for 4-cylinder cars <NA>
# You can switch back to the original by detaching and reattaching
detach_ducklake("demo_lake")
attach_ducklake("demo_lake", lake_path = lake_dir)Important: Transactions committed after a backup won’t be tracked when recovering. The data will exist in the Parquet files, but the backup will point to an earlier snapshot.
Best practices:
- Back up after batch jobs complete
- For streaming/continuous updates, schedule periodic backups
- Consider using cronR or taskscheduleR for automated backups
Backing Up Storage (Data Files)
Since Parquet files are immutable, backing up storage is straightforward.
Cloud Storage Backup
For cloud storage, use provider-specific mechanisms:
AWS S3: - Cross-bucket replication (copies to a different bucket automatically) - AWS Backup service (scheduled backups within the same bucket) - S3 versioning (keeps previous versions of objects)
Google Cloud Storage: - Cross-bucket replication - Backup and DR service - Object versioning with soft deletes
When using cross-bucket replication, update your data path:
# Original
attach_ducklake(
ducklake_name = "prod_lake",
data_path = "s3://original-bucket/data"
)
# After recovery from replicated bucket
attach_ducklake(
ducklake_name = "prod_lake",
data_path = "s3://backup-bucket/data"
)Recovery Procedures
Recovering from Catalog Backup
If your catalog is corrupted or lost:
# Restore from backup by copying the backup file
file.copy(
from = file.path(lake_dir, "backups",
paste0("demo_lake_backup_", Sys.Date(), ".ducklake")),
to = file.path(lake_dir, "demo_lake.ducklake"),
overwrite = TRUE
)
# Reattach to the restored database
attach_ducklake("demo_lake", lake_path = lake_dir)
# Verify recovery by listing snapshots
list_table_snapshots("cars")Maintenance Considerations
When planning backups, coordinate with maintenance operations:
- Compaction (merging adjacent files): Run before backups to ensure consistent file layout
- Cleanup (removing obsolete files): Run before backups to avoid backing up unnecessary files
# Recommended backup sequence
# 1. Run maintenance operations (if needed)
# See maintenance vignettes for details
# 2. Ensure all transactions are committed
# (no pending work)
# 3. Back up catalog
dir.create(file.path(lake_dir, "backups"), showWarnings = FALSE)
file.copy(
from = file.path(lake_dir, "demo_lake.ducklake"),
to = file.path(lake_dir, "backups",
paste0("backup_", format(Sys.time(), "%Y%m%d_%H%M%S"), ".ducklake"))
)
# 4. Back up data files
dir_copy(
path = file.path(lake_dir, "main"),
new_path = file.path(lake_dir, "backups", "main_latest")
)Complete Backup Example
DuckLake provides a convenient backup_ducklake()
function for creating timestamped backups:
# Create a complete backup with timestamp
backup_dir <- backup_ducklake(
ducklake_name = "demo_lake",
lake_path = lake_dir,
backup_path = file.path(lake_dir, "backups")
)
#> Catalog backed up successfully
#> Data files backed up successfully
#> Backup completed: /tmp/RtmpY4o8rb/storage_backups_vignette/storage_demo/backups/backup_20260209_212017
# The function returns the backup directory path
print(backup_dir)
#> [1] "/tmp/RtmpY4o8rb/storage_backups_vignette/storage_demo/backups/backup_20260209_212017"The backup_ducklake() function: - Creates a timestamped
backup directory - Copies the catalog database file - Copies all data
files from the main/ directory - Returns the backup directory path for
reference
Cleanup
# Detach the demo lake
detach_ducklake("demo_lake")
# Clean up temporary files
unlink(lake_dir, recursive = TRUE)Summary
Key takeaways for managing DuckLake storage and backups:
- Understand the architecture: Catalog (metadata) and storage (data) are separate components
- Choose storage wisely: Balance latency, scalability, cost, and accessibility
- Files are immutable: DuckLake never modifies existing Parquet files
- Back up regularly: Catalog backups are critical; back up after batch jobs
- Coordinate with maintenance: Run compaction and cleanup before backups
- Test recovery procedures: Ensure you can actually restore from backups
For production systems, consider:
- Automated backup scheduling (using cronR or taskscheduleR)
- Multiple backup locations (local and cloud)
- Testing recovery procedures regularly
- Monitoring backup success and storage usage
- Version control for catalog schema changes
