Simplifying Database Extraction
If you've ever needed to regularly extract data from production databases and load it into a data warehouse or analytics platform, you know the pain. You're juggling database connections, managing extraction windows, formatting data, handling cloud storage APIs, and somehow keeping everything running smoothly on a schedule. It's tedious, error-prone, and takes time away from actually analyzing data.
That's exactly why I built the Flexible Database Extractor.
https://github.com/steveoliai/flexible_extract
The Problem: Batch Data Extraction is Harder Than It Should Be
Most organizations need to regularly extract operational data from their production databases. Maybe you're:
Loading daily transaction data into BigQuery or Snowflake
Syncing customer records to your analytics platform
Creating data snapshots for reporting and compliance
Moving data from legacy systems to modern cloud data warehouses
The typical approach involves writing custom scripts for each database, manually handling time windows, wrestling with different cloud storage SDKs, and hoping nothing breaks when you schedule it. Each new data source means more custom code to maintain.
The Solution: A Unified Extraction Tool That Just Works
The Flexible Database Extractor eliminates this complexity with a single, configuration-driven tool that handles the entire extraction pipeline:
1. Works With Your Existing Databases
No matter where your data lives - PostgreSQL, MySQL, Oracle, or SQL Server - the tool connects seamlessly. No need to write different extraction logic for each database type.
2. Intelligent Time-Based Extraction
The tool is designed for ongoing batch extracts. Configure it once with a "look-back window" (like "last 3 days"), and it automatically extracts data through yesterday every time it runs. No manual date calculations, no missed data, no duplicates.
3. Cloud-Native Storage Support
Export directly to Google Cloud Storage, AWS S3, or Azure Blob Storage - or save locally if you prefer. The tool handles all the cloud provider APIs, authentication, and upload logic for you.
4. Optimized Parquet Output
Data is automatically exported in Parquet format with configurable compression. This means smaller files, faster queries, and seamless integration with modern data warehouses. You can either load these files directly or add them as external tables - your choice.
5. Flexible Enough for Real-World Needs
Beyond simple table dumps, you can use custom SQL queries to extract exactly what you need. Need to join tables, filter specific statuses, or pre-aggregate data? Just write the query and let the tool handle the rest.
Who Should Use This?
This tool is perfect for:
Data Engineers building ELT pipelines who need reliable, repeatable data extraction
Analytics Teams who want to get operational data into their warehouse without writing custom code
DevOps Teams managing data migrations or creating backup workflows
Anyone tired of maintaining multiple extraction scripts across different databases and cloud platforms
A Real-World Example
Let's say you need to extract daily orders from your PostgreSQL database and load them into BigQuery. Your configuration file looks like this:
source_database:
type: postgresql
host: prod-db.example.com
database: ecommerce
user: readonly_user
password: ${DB_PASSWORD}
sources:
- name: daily_orders
schema: sales
table: orders
timestamp_column: created_at
last_n_days: 1
target:
type: gcs
object_name: "exports/{{schema}}/{{table}}/{{yesterday}}/run={{run_ts}}.parquet"
gcs:
bucket: your_GCP_bucket
service_account_key_file: path to serivce account key.json
That's it. Run it daily via cron or your orchestration tool of choice, and you'll have fresh order data in GCS every morning, automatically partitioned by date and ready to load into BigQuery.
Design Philosophy: Configuration Over Code
The key insight behind this tool is simple: data extraction patterns are remarkably similar across organizations. You're almost always extracting data based on time windows, you need it in a columnar format, and you want it in cloud storage.
Rather than writing this logic repeatedly, the Flexible Database Extractor lets you declare what you want to extract via configuration, while it handles how to extract it. This means:
Less code to write and maintain
Consistent extraction patterns across all your data sources
Easy onboarding for new team members
Quick changes when requirements evolve
Getting Started
The tool is straightforward to set up:
Install via pip
Create a YAML configuration file for your extraction
Run it manually or schedule it
Point your data warehouse at the Parquet files
Pro tip: Always create a read-only database user specifically for extractions. It's a simple security best practice that prevents accidental data modifications.
The Bottom Line
If you're manually cobbling together database extraction scripts, or if you're paying for an expensive ETL platform just to move data from databases to cloud storage, give the Flexible Database Extractor a try. It's designed to do one thing really well: get your data out of operational databases and into formats and locations where you can actually use it for analytics.
Sometimes the best tool is the one that takes a common, repetitive task and makes it boring - in a good way. Your time is better spent analyzing data than wrestling with extraction pipelines.