Scaling HR & Payroll APIs: Four Patterns for Enterprise Data Distribution

A system design approach for building API platforms that serve HR and Payroll data at enterprise scale, supporting 70,000+ employees per client with four distinct access patterns.

Krishna C

December 7, 2022

•

9 min read

TL;DR Building an HR & Payroll API platform for 70,000+ employees per client requires four distinct patterns: event-driven delta sync for daily changes (14MB instead of 1.4GB), metadata-driven in-memory merge for flexible snapshots, async jobs for historical point-in-time queries, and file-based integrations for legacy partners. Match your architecture to partner access patterns instead of building one-size-fits-all solutions.

The Challenge

Building an HR & Payroll System of Record (SOR) that serves as a reliable API platform for client partners is fundamentally a data distribution problem at scale. When you're managing 70,000+ employees per client with full HR records (~20KB per employee, totaling ~1.4GB per client), partners like benefits providers, payroll processors, and third-party integrations need efficient, reliable access to this data.

The Context: This isn't just a traditional HR system. It's a Platform-as-a-Service (PaaS) where clients can configure, customize, and build their own features on top of our HR Operating System (HROS). Clients define custom workflows, add new fields, create business rules, and integrate unique processes without writing code. This requirement for extreme flexibility drives our metadata-driven architecture across the entire platform, from data models to business logic to API responses.

The core challenge: How do you serve massive volumes of HR data to diverse partners with different access patterns, while maintaining system reliability, performance, and data consistency, all while supporting a highly configurable platform where each client's HR system can be uniquely customized?

Understanding the Scale

Before diving into solutions, here's the data scale:

Per Employee HR Record: ~20KB (demographics, employment details, compensation, benefits, tax info, direct deposit)
70,000 employees: ~1.4GB total payload for full export
Daily changes (1% change rate): ~14MB (700 employees × 20KB)
Payroll data per pay period: ~525MB (70,000 employees × 7.5KB)

With multiple client partners requesting data concurrently, the API platform must support 300+ concurrent requests while maintaining sub-second response times for real-time queries and efficient bulk data delivery for large exports.

Four Core Use Cases

I analyzed partner needs across benefits providers, payroll systems, and third-party integrations. Four distinct access patterns emerged:

Use Case 1: Changes Only (Event-Driven Delta)

The Need: Partners don't want to pull 1.4GB every time they sync. They only need what changed since their last sync.

The Solution: Metadata-Driven Business Events Store

Rather than relying on database-level change data capture (CDC), we use metadata-driven business events published to Kafka.

Why Business Events Beat Database CDC:

Business Context: Events include "why" (termination: voluntary resignation) not just "what" (status = inactive)
Logical Grouping: One business event (new hire) might touch 10 tables, but partners see one coherent change
Partner-Ready: Events already structured for API consumption, no transformation needed
Configuration-Driven: New event types added via metadata

For a 1% daily change rate (700 employees), the payload is only ~14MB instead of 1.4GB.

Use Case 2: Daily Snapshot (Periodic Full Sync)

The Need: Some partners need complete HR records on a regular schedule (daily, weekly, monthly) for reconciliation, reporting, or their own data warehouse refresh. Critically, different partners need different data combinations. A benefits provider needs employee + benefits + dependents (~15KB/record), while a payroll partner needs employee + position + compensation (~12KB/record).

The Solution: Metadata-Driven In-Memory Merge Algorithm

Rather than hardcoding joins in SQL or creating separate endpoints for each partner's needs, we use a metadata-driven merge engine that allows partners to select exactly which data sources they need via query parameters. This provides GraphQL-like flexibility without GraphQL's complexity or N+1 query problems.

Why In-Memory Merge Instead of SQL Joins?

In a metadata-driven PaaS where clients customize their HR system, data complexity multiplies:

Data spans multiple systems: HR core (RDBMS), benefits (separate service), position hierarchy (graph database), company structure (NoSQL)
Client-specific customizations: Each client may have custom fields, data sources, and relationships defined in metadata
Business logic at merge time: Calculating eligibility, applying privacy rules, formatting based on partner permissions
Client Partner-specific data combinations: One endpoint serves 20+ partner types with different needs
Schema evolution: New data sources added via configuration without code changes

For a detailed technical deep-dive on the metadata-driven merge algorithm, including AST structure, merge strategies (flatMerge, objectMerge, nestedMerge, arrayMerge), and implementation patterns, read: Metadata-Driven Merge: A Declarative Approach to Data Integration

Key Design Decisions:

Read Replicas: Route heavy queries to dedicated read replicas to avoid impacting transactional workload
Redis Cache: Merged results cached with TTL of 1-6 hours (24-hour staleness acceptable for bulk reconciliation)
Horizontal Scaling: Merge engine scales horizontally, each instance can handle merge independently
Metadata-Driven: New data sources or relationships added via configuration, not code

With proper indexing, parallel fetching, and read replicas, partners can pull complete datasets in 70 paginated requests (1000 records each), completing in 2-5 minutes.

Use Case 3: Advanced Snapshot As-Of (Point-in-Time Reconstruction)

The Need: Advanced partners (especially compliance, audit, or benefits reconciliation teams) need to see data as it existed on a specific historical date. For example, "Show me all employees and their benefit elections as of October 1, 2022."

The Solution: Async Job with Webhook Notification

This is the most resource-intensive request pattern. We process it asynchronously, generate the data extract, store it in S3, and notify the partner via webhook when ready for download. Our database uses effective dating, so we query all tables with the asOfDate parameter to retrieve historical snapshots without event replay.

Why Async + Webhook?

Resource Intensive: Querying effective-dated data across multiple tables and merging 70K employees takes 2-5 minutes
Prevents Timeout: HTTP timeouts would kill long-running requests
Partner Flexibility: Partners can download the complete file via signed URL or request paginated batches if they prefer streaming
Retry Logic: Failed webhook deliveries are automatically retried with exponential backoff

Implementation Details:

Effective Dating: All tables use effective_from and effective_to columns for temporal queries
Worker Pool: Kubernetes-based worker pool scales based on queue depth (10-50 workers)
Batch Processing: Workers process in batches of 1000 to optimize memory usage
Delivery Options: Partners can download via S3 signed URLs (expires in 24 hours) or request data in paginated batches via API

Use Case 4: Lightweight Integrations (File-Based Exchange)

The Need: About 40% of our partners are legacy systems or smaller vendors that don't have API capabilities. They need data in traditional file formats (CSV, XML, XLSX) delivered via SFTP on a schedule.

The Solution: Client-Formatted File Delivery System

We built what we call "Lightweight Integrations," a sophisticated system that generates custom-formatted files for each partner. The system handles format transformations, client-specific field mappings, and scheduled deliveries without requiring partners to build API integrations.

Key Features:

Format Support: CSV, XML, XLSX, fixed-width, and custom delimited formats
Client Configurations: Each partner defines their field mappings, data filters, and transformation rules via metadata
Scheduled Delivery: Daily, weekly, bi-weekly, or monthly automated file generation and SFTP delivery
Custom Transformations: Partner-specific data formatting (date formats, currency codes, enum mappings)
Reconciliation Files: Automated generation of control files with record counts and checksums

Why This Matters:

Not every partner can or wants to build API integrations. By supporting file-based exchange alongside our modern API patterns, we serve 100% of partner needs without forcing technology choices on them. The metadata-driven approach means adding a new file format or partner configuration is just a config change, not a code deployment.

This pattern handles a significant volume of our data distribution, proving that modern architecture doesn't mean abandoning proven patterns. Sometimes a well-executed CSV file delivered via SFTP is exactly what the partner needs.

System Design Overview

Here's how these four patterns come together in a unified architecture:

Cross-Cutting Concerns:

1. Authentication and Authorization (OAuth 2.0)

Partners authenticate via OAuth 2.0 client credentials flow
JWT tokens include scopes limiting access to specific data types and operations
Token validation at API Gateway layer with Redis-cached token introspection

2. Rate Limiting (Token Bucket Algorithm)

300 concurrent requests supported per client
Per-partner rate limits: Tier 1 (100 req/sec), Tier 2 (500 req/sec), Tier 3 (2000 req/sec)
Burst allowance for legitimate traffic spikes
Expensive operations (bulk exports) have separate quota limits

3. Horizontal Scaling

API Gateway: Auto-scales based on request count (Kubernetes HPA)
API Services: Scale on CPU/memory metrics
Worker Pool: Scales based on job queue depth
Database: Read replicas distribute query load

4. Redis Cache Strategy

Real-time Lookups: 5-15 minute TTL, event-driven invalidation
Delta Queries: 1-6 hour TTL (24-hour staleness acceptable)
Bulk Snapshots: 24-hour TTL, regenerated daily
Distributed Redis Cluster with sharding for high availability

5. Multi-Tenant Data Isolation and Sharding

With each client potentially customizing their HROS differently and managing 70K+ employees, data isolation is critical for performance, security, and compliance.

Tiered Isolation Strategy:

Large Clients (10K+ employees): Dedicated database shards. Separate physical databases ensure noisy neighbor isolation and allow client-specific optimization
Medium Clients (1K-10K employees): Schema-per-tenant within shared databases. Provides logical isolation with resource efficiency
Small Clients (< 1K employees): Row-level security with tenant_id filtering. Maximum resource sharing for cost efficiency

Why Client Sharding Matters:

Performance Isolation: One client's 70K employee export doesn't impact another client's real-time queries
Custom Schema Support: Large clients with extensive customizations can have optimized database schemas
Data Sovereignty: Clients with regulatory requirements can have data in specific geographic regions
Independent Scaling: High-growth clients scale independently without affecting others

Key Metrics & Monitoring

Performance Targets:

Changes API (Use Case 1): < 500ms P95 latency, handles 14MB typical daily payload
Snapshot API (Use Case 2): < 2 seconds per paginated page (1000 records), 2-5 minutes for complete 70K export
Async Export API (Use Case 3): Job queued < 100ms, complete export ready in 2-5 minutes, 99.9% webhook delivery success
File-Based Integrations (Use Case 4): Files generated and delivered within scheduled window, 99.9% delivery success

Infrastructure Metrics:

API Gateway: 300+ concurrent requests, < 1% error rate
Rate Limiter: Per-partner quota enforcement, burst handling
Redis Cache: > 80% hit rate for snapshot queries
Worker Pool: Queue depth < 1000 jobs, auto-scaling 10-50 workers
Database: Read replica lag < 5 seconds

What I Learned

Building an enterprise-grade HR & Payroll API platform taught me that partner access patterns should drive your architecture. The four patterns we built work because each solves a specific problem:

Changes Only for efficient delta sync via business events
Daily Snapshot for periodic full reconciliation via paginated APIs
Advanced Snapshot As-Of for historical point-in-time reconstruction via async jobs
Lightweight Integrations for file-based exchange with legacy partners

This approach lets us serve 70,000+ employee records efficiently while maintaining sub-second response times for real-time queries and reliable delivery for bulk exports. Supporting both modern APIs and traditional file-based integrations means we serve 100% of partner needs without forcing technology choices.

The key decisions (OAuth for security, rate limiting for stability, Redis caching for performance, horizontal scaling for growth, and event-driven design for flexibility) create a platform that meets enterprise SLAs while remaining cost-effective and maintainable.

If you're building a similar data distribution platform, start by deeply understanding your partner access patterns. Then design your architecture to optimize for those specific use cases rather than trying to build a one-size-fits-all solution. Don't be afraid to support traditional patterns alongside modern ones when that's what your partners need.

Thoughts? Hit me up at [email protected]

#api-design

#system-design

#hr-systems

#enterprise-architecture

← Previous

Event-Sourced Audit Logs for HR & Payroll at Scale

System design for immutable activity logs using event sourcing patterns, optimized for multi-tenant SaaS.

Building MLOps Infrastructure: MLFlow, Airflow, and Ray on Kubernetes

Production MLOps stack on AWS EKS. MLFlow experiment tracking, Airflow orchestration, Ray distributed training, and Evidently AI drift monitoring.