Metadata-Driven Merge: A Declarative Approach to Data Integration

Building a lightweight alternative to GraphQL for hierarchical data merging using Go, with concurrent fetching and configurable merge strategies.

Krishna C

October 18, 2020

•

4 min read

TL;DR I built a Go-based system that merges data from multiple sources using declarative templates instead of custom code. Define relationships in JSON (as meta-data), request what you need at runtime, and the engine handles concurrent fetching and hierarchical merging automatically.

I kept running into the same problem: applications needing to combine data from multiple sources into a unified, hierarchical structure. Consider an employee directory that needs to merge:

Associate records from one database
Email addresses (personal and business) from another system
Phone numbers from a third source
Position history linked to companies with office addresses

Each of these has complex business logic for fetching. Different auth, different pagination, different error handling. The traditional approach? Write custom code for every permutation of data sources. But what if users want different combinations at runtime? What if relationships change? You're stuck maintaining a tangled web of join logic.

The Solution: Metadata-Driven Merge

I built a Go-based solution that takes a fundamentally different approach: define relationships declaratively, let the engine figure out the rest.

Instead of writing imperative code like:

1// Don't do this for every combination...
2associates := fetchAssociates()
3for _, a := range associates {
4    a.Emails = fetchEmails(a.ID)
5    a.Positions = fetchPositions(a.ID)
6    for _, p := range a.Positions {
7        p.Company = fetchCompany(p.CompanyID)
8    }
9}

You define a merge template that describes the data hierarchy:

1{
2  "type": "group",
3  "alias": "associate",
4  "mergeType": "nestedMerge",
5  "children": [
6    { "type": "data_contract", "alias": "associate" },
7    { "type": "group", "alias": "email", "mergeType": "nestedMerge", "children": [...] },
8    { "type": "group", "alias": "position", "mergeType": "nestedMerge",
9      "children": [
10        { "type": "data_contract", "alias": "position" },
11        { "type": "group", "alias": "company", "children": [...] }
12      ]
13    }
14  ]
15}

Then simply request what you need:

1GET /associates?datasources=associate,email,position,company

The engine handles everything else.

How It Works

1. Template Shrinking

The first clever trick: the template is an AST (Abstract Syntax Tree) that gets pruned at runtime.

If you only request associate and email, the engine recursively removes the position and company branches. This means:

No wasted fetches for unrequested data
Validation that requested sources have valid lineage (you can't request company without position)
Optimal query planning

2. Parallel Data Fetching

Once the template is shrunk, data contracts are resolved concurrently:

1go func(dc *dataContract) {
2    defer wg.Done()
3    data := processDataContract(dc)
4    channel <- contractStoreItem{dc.Alias, data}
5}(dataContract)

Seven data sources? Seven goroutines. The system only waits as long as the slowest source.

3. Four Merge Strategies

The real power is in how data gets combined. The system supports four merge types that handle different cardinalities:

Merge Type	Cardinality	Result
flatMerge	1-to-1	Child fields added directly to parent
objectMerge	1-to-1	Child nested as single object property
nestedMerge	1-to-Many	Children nested as array
arrayMerge	N/A	Combines data from multiple contracts at same level

This means you can express:

"Each associate has one username" → flatMerge
"Each associate has one current address" → objectMerge
"Each associate has many positions" → nestedMerge
"Emails come from personal AND business systems" → arrayMerge

4. Bottom-Up Tree Traversal

The merge happens from the leaves up. Company merges into position, position merges into associate. Each level uses join keys defined in the template:

1{
2  "parentKey": "company_id",
3  "currentKey": "_id"
4}

The Result

A single API call transforms scattered data:

Input (7 separate data sources):

1associate.json      → [{_id: "001", firstname: "Krishna"}]
2username.json       → [{associate_id: "001", username: "inventivepotter"}]
3personal-email.json → [{associate_id: "001", email: "[email protected]"}]
4business-email.json → [{associate_id: "001", email: "[email protected]"}]
5position.json       → [{associate_id: "001", company_id: "c01", name: "Sr. Engineer"}]
6company.json        → [{_id: "c01", name: "Acme Corp"}]

Output (unified hierarchy):

1{
2  "associates": [{
3    "_id": "001",
4    "firstname": "Krishna",
5    "username": "inventivepotter",
6    "email": [
7      {"email": "[email protected]"},
8      {"email": "[email protected]"}
9    ],
10    "position": [{
11      "name": "Sr. Engineer",
12      "company": {
13        "name": "Acme Corp"
14      }
15    }]
16  }]
17}

Why This Matters

Flexibility Without Code Changes

Need to add a new data source? Add it to the template. Need a new relationship? Define the merge type and keys. No recompilation, no new endpoints.

GraphQL Vibes, Simpler Implementation

This achieves similar goals to GraphQL (client-driven data selection, hierarchical responses) but with a fraction of the complexity. No schema definitions, no resolvers, no query parsing.

Production-Ready Patterns

The architecture demonstrates patterns that scale:

Concurrent processing with goroutines and channels
Template-based configuration for operations teams
Hash maps for O(1) lookups during merge operations
Memory cleanup after merge phases

Production Additions

We ran into memory issues pretty quickly. The merge hashmap grows fast when you're combining millions of records. We added Redis caching to offload the intermediate merge state. Keys expire after the merge completes, so we're not paying for storage we don't need.

We also added streaming output where clients support it. Instead of building the entire response in memory, we stream merged records as they complete. Works great for large result sets and gives users faster time-to-first-byte.

The README hints at other production features not included in this sample:

Queryable datasources with parameterized inputs
Batched data fetching for high-volume scenarios
Data transformation rules and multiple output formats
Scheduling and event-driven triggers

Summary

Metadata-Driven Merge demonstrates that complex data integration doesn't require complex code. By treating relationships as configuration rather than implementation, you get:

Maintainability: Change behavior without changing code
Performance: Parallel fetching, optimal pruning
Flexibility: Any combination of data sources at runtime

Sometimes the best abstraction isn't a new query language. It's a well-designed metadata template and a smart engine to interpret it.

Thoughts? Hit me up at [email protected]

← Previous

Running Jenkins in Kubernetes: Why We Left EC2 Behind

Scaling Jenkins agents dynamically in Kubernetes beats static EC2 instances. Here's what worked, what broke, and how we solved Docker-in-Docker nightmares with BuildKit.

POSIT: Zero Knowledge Identity for the Privacy-First Era

A concept for privacy-focused Identity Access Management using Zero Knowledge Proofs and End-to-End Encryption. Letting businesses verify users without ever seeing their data.