Skip to content
~/writing/partition-toggling-compliance

essay / payments

Partition toggling: a compliance hack under pressure

When the purge backlog grew faster than we could process it, the fix was not a faster purge job. It was using the database topology we already had.

PCI compliance requires purging payment instruments and verification data within a set time window. As Expedia’s payment platform onboarded more brands and the CIT/MIT framework increased the volume of stored instrument records, our purge jobs were constantly racing to keep up.

Then SQL Server patching took the jobs offline for a few days. When they came back, the backlog was growing faster than we could process it. We were days away from falling out of compliance.

The constraint

The purge jobs competed for database resources with our production API. Running them harder meant degrading the booking path. Running them during off-peak hours only was not enough to close the gap. We needed a way to dramatically increase purge throughput without impacting live traffic.

Seeing the topology differently

I pulled the numbers and noticed three things. The performance impact of the purge job fell primarily on GET calls. Over 95% of GET requests for a saved instrument happened within 24 hours of the POST. And our older database partitions still had capacity to accept new registrations.

This meant we could redirect new registrations to a different partition each week, then run aggressive purge jobs on the partition that was no longer receiving new writes. The 95% of reads that happened within 24 hours would hit the active partition. Only the 5% from reporting applications (not in the booking path) would cross partitions.

The risk

This had never been done before. Moving registrations between shards meant the cross-partition token master dependency would be exercised in a way it had not been tested at scale. I documented the risk, wrote the rollback procedure, and ran a pre-mortem review with the team.

The outcome

The deployment went smoothly. Registration latency had a slight bump from the cross-partition dependency, but stayed within SLOs. Within the first week, running the purge job at 4x throughput for longer windows cleared the backlog entirely. We moved registrations back to the primary partition and documented the entire procedure.

The procedure was used once more when we needed to reindex certain tables, proving the pattern was reusable beyond the original emergency.

The longer arc

This experience also made it clear that our purge architecture was not going to scale with the platform’s growth trajectory. I proposed migrating to DynamoDB with TTL-based purging, eliminating the purge jobs entirely. That became a multi-quarter project. Even with only a subset of clients onboarded to the new system, we hit SLOs that were 50% tighter than the original service. The projected RDS cost savings alone were over $1.4M per year.

Sometimes the right response to an emergency is a clever workaround. But if the emergency reveals a structural problem, the workaround should come with a proposal for the real fix.