When Your Pipeline Breaks Because the File Keeps Changing

Some time ago, I was working with a client who needed a lightweight data pipeline. They provided a sample data file that I needed to load into a CloudSQL for Postgres database hosted on GCP. The design was straightforward:

  • A file would be uploaded to a GCS bucket

  • This would trigger a Cloud Function

  • The Cloud Function would parse the file and load the data into a PostgreSQL table

Simple enough.

When the Unexpected Happens

The pipeline ran without issue for about a week - then I got an alert. A file had failed to load.

After some investigation, I found the problem: The structure of the file had changed.

The number of columns, their names, and their order were all different from the sample I was originally given. I flagged the issue with the team providing the files and was assured it wouldn't happen again.

A few days later: it happened again.

And again.

Each time, the incoming file would have different columns, and the pipeline would fail with a schema mismatch or a parsing error.

Fighting Schema Drift with JSONB

Rather than trying to chase a moving target or rewrite parsing logic every time the file changed, I decided to flip the model:

New Approach:

  1. Treat the file as semi-structured data

  2. Parse each row into a JSON object

  3. Load the resulting JSON into a jsonb column in a PostgreSQL table

  4. Use Postgres stored procedures to extract only the fields I care about

This allowed me to:

  • Accept whatever they sent

  • Store it without error

  • Extract the required elements after the fact

  • Handle schema evolution gracefully without breaking the pipeline

PostgreSQL as Your NoSQL Solution

PostgreSQL is incredibly powerful as a NoSQL database. The jsonb type isn't just a backup plan for messy data - it's a first-class citizen that gives you:

  • Schema flexibility without sacrificing SQL power

  • Rich JSON querying with operators like ->, ->>, @>, ?, and ?&

  • Indexing on JSON fields using GIN indexes for fast queries

  • JSON aggregation functions for complex analytics

  • The best of both worlds - relational structure where you need it, document flexibility where you don't

You can query JSON data as naturally as traditional columns:

SELECT 
    jsonb_data->>'customer_id' as customer_id,
    jsonb_data->'order_details'->>'total' as order_total
FROM raw_data 
WHERE jsonb_data @> '{"status": "completed"}';

Monitoring Data Quality at Scale

Of course, with great flexibility comes the responsibility to monitor data quality more carefully. When you're accepting "whatever they send," you need robust monitoring to catch issues downstream.

I covered building a comprehensive data quality monitoring system in a previous post, but the key principles here are:

  • Profile your JSON data regularly - track which fields are present, their data types, and value distributions

  • Set up alerts for missing critical fields - just because you can accept anything doesn't mean you should ignore when important data disappears

  • Monitor schema evolution - track when new fields appear or existing ones change format

  • Implement data freshness checks - ensure your extraction logic stays current with the actual data structure

The beauty of the JSONB approach is that your monitoring can evolve alongside your data without breaking your ingestion pipeline.

Why This Worked

  • PostgreSQL's jsonb type is perfect for this kind of flexibility

  • I no longer had to enforce or validate column order/structure during ingest

  • If the structure of the file changed, the pipeline would still succeed

  • Any required fields could be extracted, versioned, or even defaulted downstream

  • PostgreSQL performs as well as dedicated NoSQL solutions for most use cases

In short: I stopped expecting structure at ingest and pushed structure downstream.

Final Thoughts

If you're dealing with third-party files or semi-structured data where the schema is unstable or outside your control, store it as-is. Validate and extract later.

PostgreSQL's jsonb makes this approach not just viable, but often superior to traditional NoSQL solutions. You get schema flexibility with the reliability, ACID compliance, and rich querying capabilities that PostgreSQL provides.

Don't overlook PostgreSQL when considering NoSQL use cases - it might be the only database you need.

Bonus Tip

I also added a logging step that would store the original file name and a timestamp, just in case I needed to trace or reprocess specific loads. This metadata becomes even more valuable when you're storing flexible JSON data.

Next
Next

Proof of Concept: Building a Data Quality Monitor