PVR: rearchitecting the payment vault's data layer for exponential growth

Context

Expedia Group’s payment vault stored over 500 million payment instruments across 20+ brands. PCI DSS regulations required purging instrument and verification data within strict time windows. The vault ran on SQL Server with data sharded across multiple partitions, and purge jobs ran as scheduled batch processes during off-peak hours.

As the platform onboarded more brands and the CIT/MIT (Customer Initiated Transaction / Merchant Initiated Transaction) framework increased the volume of stored instrument records, the purge jobs began falling behind. During high-traffic weeks, the backlog grew faster than we could process it.

I was tasked with tuning the batch jobs and stored procedures to scale for the upcoming peak season without impacting API SLAs.

Constraints

The purge jobs and the production API shared the same database infrastructure. Running purges harder degraded booking-path latency. PCI compliance had a hard time window: fall behind and the company was out of compliance, with real regulatory consequences. The platform was growing exponentially year-over-year as new brands integrated, meaning any fix had to scale with that growth trajectory, not just handle the current load.

Peak season traffic was approaching, which would compound the problem. And the solution could not require downtime or changes to the API contract, since dozens of client teams depended on the vault’s interfaces.

Architecture

The project evolved through three phases, each building on the one before.

Phase 1: Tactical tuning. I analyzed data growth rates and determined the minimum purge throughput needed to scale for projected growth. I made targeted changes to batch job scheduling, stored procedure efficiency, and batch sizing to increase throughput without crossing the latency impact threshold for production APIs.

Phase 2: Partition toggling. When a SQL Server patching incident created a multi-day purge backlog that threatened compliance, I designed a partition toggling strategy. By redirecting new registrations to an alternate partition each week, we could run aggressive purge jobs on the idle partition without impacting live traffic. Over 95% of instrument reads happened within 24 hours of creation, so the active partition served nearly all production traffic while the idle partition was purged at 4x normal throughput. This cleared the backlog within a week and became a documented, reusable operational procedure.

Phase 3: DynamoDB rearchitecture. The tactical fixes bought time, but the fundamental problem remained: purging at scale on a relational database is an inherently adversarial workload against production reads. I proposed migrating the vault’s data layer to DynamoDB with TTL-based expiration, eliminating purge jobs entirely. Records would expire automatically based on their TTL attribute, with no batch processing, no database contention, and no compliance risk from processing backlogs.

The migration design preserved the existing API contract so client teams required zero changes. Data was dual-written during the transition period, with reads gradually shifted to DynamoDB as each client segment was validated.

Hard Problems

Designing the DynamoDB data model to support the vault’s access patterns was the core technical challenge. The SQL Server schema had been optimized over years for relational queries, cross-partition token lookups, and reporting workloads. DynamoDB required rethinking how instruments were indexed and accessed, particularly for the executive-assistant use case where a single user might access hundreds of instruments across organizational groups.

The dual-write migration had to be provably correct. Any inconsistency between SQL Server and DynamoDB during the transition could result in payment failures or data loss. We built reconciliation tooling that continuously compared records across both stores and flagged discrepancies.

Convincing stakeholders to approve a multi-quarter rearchitecture when the tactical fixes were “working” required building a clear financial and risk model. The tuning and partition toggling bought quarters, not years. I projected the growth curve against the purge capacity ceiling and showed exactly when we would hit structural failure again.

Outcome

Phase 1 and 2 delivered immediate relief: purge backlogs cleared, compliance maintained, and the procedures became reusable operational tools. Phase 3 fundamentally changed the economics. Even with only a subset of clients migrated, the new system hit SLOs that were 50% tighter than the original service.

The projected cost savings from decommissioning the RDS infrastructure were over $1.4M per year (RDS db.m5.24xlarge at $129k/month replaced by DynamoDB at roughly $82k/year). TTL-based purging eliminated compliance risk entirely: there were no batch jobs to fail, no backlogs to accumulate, no contention with production traffic.

What I’d Do Differently

I would have started the DynamoDB evaluation earlier. The tactical tuning was necessary to buy time, but I spent several sprints optimizing a system I already suspected was structurally inadequate. Starting the rearchitecture design in parallel with the tactical fixes, rather than sequentially, would have compressed the overall timeline.

The partition toggling approach, while clever under pressure, introduced operational complexity that lasted longer than it should have. It became a crutch that reduced the urgency for the real fix. If I had framed the partition toggling explicitly as a stopgap with a defined expiration date from the beginning, the transition to DynamoDB would have faced less organizational inertia.

On the DynamoDB migration itself, I would invest more in automated access-pattern validation before the dual-write phase. We discovered a few access patterns that worked differently under DynamoDB’s consistency model only after traffic was flowing, requiring quick adjustments that could have been caught earlier with more thorough synthetic testing.