Scaling HR & Payroll APIs: Four Patterns for Enterprise Data Distribution
A system design approach for building API platforms that serve HR and Payroll data at enterprise scale, supporting 70,000+ employees per client with four distinct access patterns.
December 7, 2022
•
9 min read
TL;DR Building an HR & Payroll API platform for 70,000+ employees per client requires four distinct patterns: event-driven delta sync for daily changes (14MB instead of 1.4GB), metadata-driven in-memory merge for flexible snapshots, async jobs for historical point-in-time queries, and file-based integrations for legacy partners. Match your architecture to partner access patterns instead of building one-size-fits-all solutions.
The Challenge
Building an HR & Payroll System of Record (SOR) that serves as a reliable API platform for client partners is fundamentally a data distribution problem at scale. When you're managing 70,000+ employees per client with full HR records (~20KB per employee, totaling ~1.4GB per client), partners like benefits providers, payroll processors, and third-party integrations need efficient, reliable access to this data.
The Context: This isn't just a traditional HR system. It's a Platform-as-a-Service (PaaS) where clients can configure, customize, and build their own features on top of our HR Operating System (HROS). Clients define custom workflows, add new fields, create business rules, and integrate unique processes without writing code. This requirement for extreme flexibility drives our metadata-driven architecture across the entire platform, from data models to business logic to API responses.
The core challenge: How do you serve massive volumes of HR data to diverse partners with different access patterns, while maintaining system reliability, performance, and data consistency, all while supporting a highly configurable platform where each client's HR system can be uniquely customized?
Understanding the Scale
Before diving into solutions, here's the data scale:
- Per Employee HR Record: ~20KB (demographics, employment details, compensation, benefits, tax info, direct deposit)
- 70,000 employees: ~1.4GB total payload for full export
- Daily changes (1% change rate): ~14MB (700 employees × 20KB)
- Payroll data per pay period: ~525MB (70,000 employees × 7.5KB)
With multiple client partners requesting data concurrently, the API platform must support 300+ concurrent requests while maintaining sub-second response times for real-time queries and efficient bulk data delivery for large exports.
Four Core Use Cases
I analyzed partner needs across benefits providers, payroll systems, and third-party integrations. Four distinct access patterns emerged:
Use Case 1: Changes Only (Event-Driven Delta)
The Need: Partners don't want to pull 1.4GB every time they sync. They only need what changed since their last sync.
The Solution: Metadata-Driven Business Events Store Rather than relying on database-level change data capture (CDC), we use metadata-driven business events published to Kafka.
Why Business Events Beat Database CDC:
- Business Context: Events include "why" (termination: voluntary resignation) not just "what" (status = inactive)
- Logical Grouping: One business event (new hire) might touch 10 tables, but partners see one coherent change
- Partner-Ready: Events already structured for API consumption, no transformation needed
- Configuration-Driven: New event types added via metadata
For a 1% daily change rate (700 employees), the payload is only ~14MB instead of 1.4GB.
Use Case 2: Daily Snapshot (Periodic Full Sync)
The Need: Some partners need complete HR records on a regular schedule (daily, weekly, monthly) for reconciliation, reporting, or their own data warehouse refresh. Critically, different partners need different data combinations. A benefits provider needs employee + benefits + dependents (~15KB/record), while a payroll partner needs employee + position + compensation (~12KB/record).
The Solution: Metadata-Driven In-Memory Merge Algorithm
Rather than hardcoding joins in SQL or creating separate endpoints for each partner's needs, we use a metadata-driven merge engine that allows partners to select exactly which data sources they need via query parameters. This provides GraphQL-like flexibility without GraphQL's complexity or N+1 query problems.
Why In-Memory Merge Instead of SQL Joins?
In a metadata-driven PaaS where clients customize their HR system, data complexity multiplies:
- Data spans multiple systems: HR core (RDBMS), benefits (separate service), position hierarchy (graph database), company structure (NoSQL)
- Client-specific customizations: Each client may have custom fields, data sources, and relationships defined in metadata
- Business logic at merge time: Calculating eligibility, applying privacy rules, formatting based on partner permissions
- Client Partner-specific data combinations: One endpoint serves 20+ partner types with different needs
- Schema evolution: New data sources added via configuration without code changes
For a detailed technical deep-dive on the metadata-driven merge algorithm, including AST structure, merge strategies (flatMerge, objectMerge, nestedMerge, arrayMerge), and implementation patterns, read: Metadata-Driven Merge: A Declarative Approach to Data Integration
Key Design Decisions:
- Read Replicas: Route heavy queries to dedicated read replicas to avoid impacting transactional workload
- Redis Cache: Merged results cached with TTL of 1-6 hours (24-hour staleness acceptable for bulk reconciliation)
- Horizontal Scaling: Merge engine scales horizontally, each instance can handle merge independently
- Metadata-Driven: New data sources or relationships added via configuration, not code
With proper indexing, parallel fetching, and read replicas, partners can pull complete datasets in 70 paginated requests (1000 records each), completing in 2-5 minutes.
Use Case 3: Advanced Snapshot As-Of (Point-in-Time Reconstruction)
The Need: Advanced partners (especially compliance, audit, or benefits reconciliation teams) need to see data as it existed on a specific historical date. For example, "Show me all employees and their benefit elections as of October 1, 2022."
The Solution: Async Job with Webhook Notification
This is the most resource-intensive request pattern. We process it asynchronously, generate the data extract, store it in S3, and notify the partner via webhook when ready for download. Our database uses effective dating, so we query all tables with the asOfDate parameter to retrieve historical snapshots without event replay.
Why Async + Webhook?
- Resource Intensive: Querying effective-dated data across multiple tables and merging 70K employees takes 2-5 minutes
- Prevents Timeout: HTTP timeouts would kill long-running requests
- Partner Flexibility: Partners can download the complete file via signed URL or request paginated batches if they prefer streaming
- Retry Logic: Failed webhook deliveries are automatically retried with exponential backoff
Implementation Details:
- Effective Dating: All tables use effective_from and effective_to columns for temporal queries
- Worker Pool: Kubernetes-based worker pool scales based on queue depth (10-50 workers)
- Batch Processing: Workers process in batches of 1000 to optimize memory usage
- Delivery Options: Partners can download via S3 signed URLs (expires in 24 hours) or request data in paginated batches via API
Use Case 4: Lightweight Integrations (File-Based Exchange)
The Need: About 40% of our partners are legacy systems or smaller vendors that don't have API capabilities. They need data in traditional file formats (CSV, XML, XLSX) delivered via SFTP on a schedule.
The Solution: Client-Formatted File Delivery System
We built what we call "Lightweight Integrations," a sophisticated system that generates custom-formatted files for each partner. The system handles format transformations, client-specific field mappings, and scheduled deliveries without requiring partners to build API integrations.
Key Features:
- Format Support: CSV, XML, XLSX, fixed-width, and custom delimited formats
- Client Configurations: Each partner defines their field mappings, data filters, and transformation rules via metadata
- Scheduled Delivery: Daily, weekly, bi-weekly, or monthly automated file generation and SFTP delivery
- Custom Transformations: Partner-specific data formatting (date formats, currency codes, enum mappings)
- Reconciliation Files: Automated generation of control files with record counts and checksums
Why This Matters: Not every partner can or wants to build API integrations. By supporting file-based exchange alongside our modern API patterns, we serve 100% of partner needs without forcing technology choices on them. The metadata-driven approach means adding a new file format or partner configuration is just a config change, not a code deployment.
This pattern handles a significant volume of our data distribution, proving that modern architecture doesn't mean abandoning proven patterns. Sometimes a well-executed CSV file delivered via SFTP is exactly what the partner needs.
System Design Overview
Here's how these four patterns come together in a unified architecture:
Cross-Cutting Concerns:
1. Authentication and Authorization (OAuth 2.0)
- Partners authenticate via OAuth 2.0 client credentials flow
- JWT tokens include scopes limiting access to specific data types and operations
- Token validation at API Gateway layer with Redis-cached token introspection
2. Rate Limiting (Token Bucket Algorithm)
- 300 concurrent requests supported per client
- Per-partner rate limits: Tier 1 (100 req/sec), Tier 2 (500 req/sec), Tier 3 (2000 req/sec)
- Burst allowance for legitimate traffic spikes
- Expensive operations (bulk exports) have separate quota limits
3. Horizontal Scaling
- API Gateway: Auto-scales based on request count (Kubernetes HPA)
- API Services: Scale on CPU/memory metrics
- Worker Pool: Scales based on job queue depth
- Database: Read replicas distribute query load
4. Redis Cache Strategy
- Real-time Lookups: 5-15 minute TTL, event-driven invalidation
- Delta Queries: 1-6 hour TTL (24-hour staleness acceptable)
- Bulk Snapshots: 24-hour TTL, regenerated daily
- Distributed Redis Cluster with sharding for high availability
5. Multi-Tenant Data Isolation and Sharding
With each client potentially customizing their HROS differently and managing 70K+ employees, data isolation is critical for performance, security, and compliance.
Tiered Isolation Strategy:
- Large Clients (10K+ employees): Dedicated database shards. Separate physical databases ensure noisy neighbor isolation and allow client-specific optimization
- Medium Clients (1K-10K employees): Schema-per-tenant within shared databases. Provides logical isolation with resource efficiency
- Small Clients (< 1K employees): Row-level security with tenant_id filtering. Maximum resource sharing for cost efficiency
Why Client Sharding Matters:
- Performance Isolation: One client's 70K employee export doesn't impact another client's real-time queries
- Custom Schema Support: Large clients with extensive customizations can have optimized database schemas
- Data Sovereignty: Clients with regulatory requirements can have data in specific geographic regions
- Independent Scaling: High-growth clients scale independently without affecting others
Key Metrics & Monitoring
Performance Targets:
- Changes API (Use Case 1): < 500ms P95 latency, handles 14MB typical daily payload
- Snapshot API (Use Case 2): < 2 seconds per paginated page (1000 records), 2-5 minutes for complete 70K export
- Async Export API (Use Case 3): Job queued < 100ms, complete export ready in 2-5 minutes, 99.9% webhook delivery success
- File-Based Integrations (Use Case 4): Files generated and delivered within scheduled window, 99.9% delivery success
Infrastructure Metrics:
- API Gateway: 300+ concurrent requests, < 1% error rate
- Rate Limiter: Per-partner quota enforcement, burst handling
- Redis Cache: > 80% hit rate for snapshot queries
- Worker Pool: Queue depth < 1000 jobs, auto-scaling 10-50 workers
- Database: Read replica lag < 5 seconds
What I Learned
Building an enterprise-grade HR & Payroll API platform taught me that partner access patterns should drive your architecture. The four patterns we built work because each solves a specific problem:
- Changes Only for efficient delta sync via business events
- Daily Snapshot for periodic full reconciliation via paginated APIs
- Advanced Snapshot As-Of for historical point-in-time reconstruction via async jobs
- Lightweight Integrations for file-based exchange with legacy partners
This approach lets us serve 70,000+ employee records efficiently while maintaining sub-second response times for real-time queries and reliable delivery for bulk exports. Supporting both modern APIs and traditional file-based integrations means we serve 100% of partner needs without forcing technology choices.
The key decisions (OAuth for security, rate limiting for stability, Redis caching for performance, horizontal scaling for growth, and event-driven design for flexibility) create a platform that meets enterprise SLAs while remaining cost-effective and maintainable.
If you're building a similar data distribution platform, start by deeply understanding your partner access patterns. Then design your architecture to optimize for those specific use cases rather than trying to build a one-size-fits-all solution. Don't be afraid to support traditional patterns alongside modern ones when that's what your partners need.
Thoughts? Hit me up at [email protected]