Upgrade Payment Polling Loop Causing CPU Spike
A customer’s production node spiked to 78% CPU — no deployment, no traffic surge, nothing in the changelogs. We traced the load to a single user’s browser. A failed upgrade payment had silently triggered a reactive loop in the frontend, flooding two API endpoints at 218 requests per second. Left unchecked, this would have degraded the service for every user on that node. We identified the root cause down to the exact line of code and delivered a remediation plan within five minutes.
Timeline
- ~15:00 UTC — Increased CPU usage observed on production node
- 15:03 UTC — Investigation began. Node at 78% CPU (1492m/1920m)
- 15:03 UTC — api-server pod identified as top consumer at 511m CPU
- 15:03 UTC — Nginx ingress logs analyzed: ~218 RPS to dynamic endpoints
- 15:03 UTC — Two endpoints identified as 87% of all traffic
- 15:03 UTC — Single Opera GX user-agent responsible for 84% of requests
- 15:05 UTC — App-backend logs confirmed single user from Portugal
Investigation Details
Node Status
| Node | CPU | Memory |
|---|---|---|
| prod-node-k8s-07 | 78% (1492m) | 68% (4616Mi) |
| prod-node-k8s-08 | 12% (238m) | 66% (4422Mi) |
Pod CPU on Affected Node
| Pod | CPU |
|---|---|
| api-server | 511m |
| queue-worker | 364m |
| landing-service | 44m |
| web-app | 22m |
Note: api-server runs as a single replica with no redundancy.
Traffic Analysis
Total dynamic RPS (excluding frontend static): ~258 avg, 411 peak
| Endpoint | Requests/min | % of total |
|---|---|---|
| /api/user/preferences | 7,152 (~119/s) | 47.5% |
| /api/billing/upgrade/check-status | 5,951 (~99/s) | 39.5% |
| /api/telemetry/events/log | 631 (~11/s) | 4.2% |
| Everything else | 1,335 | 8.8% |
User-Agent Breakdown
| User-Agent | Requests | % |
|---|---|---|
| Chrome/142 Opera GX/126 (Windows 10) | 16,191 | 83.8% |
| All other UAs combined | 3,125 | 16.2% |
The Opera GX user-agent sent requests to only two endpoints (/api/user/preferences and /api/billing/upgrade/check-status), with zero requests to frontend/static resources — confirming this is not normal browser behavior.
Cloudflare Analysis
Only 4 Cloudflare proxy IPs observed for the Opera UA, indicating traffic from 1–2 Cloudflare edge locations (likely a single client):
| Cloudflare IP | Requests |
|---|---|
| 172.71.xx.77 | 4,879 |
| 172.71.xx.78 | 4,483 |
| 172.71.xx.40 | 3,556 |
| 172.71.xx.39 | 3,273 |
Identified User
| Field | Value |
|---|---|
| Country | Portugal (CF-Ipcountry: PT) |
| Cloudflare Edge | AMS (Amsterdam) |
| Browser | Opera GX 126 (Chromium 142), Windows 10 |
| Page | /workspace |
| Failed Attempts Counter | 5,392 (and rising) |
Payment Status
The user has two upgrade products with auth_failed payment status:
| Product ID | Order Status | Transaction Status |
|---|---|---|
| 2c34****-7f23-****-abc2-****e0266e8b | auth_failed | none |
| b481****-7a78-****-b211-****2b08f351 | auth_failed | none |
Root Cause Analysis
The Bug
The frontend uses Effector (reactive state management). A reactive cascade creates a tight loop with no delay:
- Route opened (/workspace) triggers checkUpgradePurchasedFx
- Response returns auth_failed status for both products
- checkUpgradeStatus() correctly maps auth_failed to "Fail"
- The 5-second polling correctly stops on "Fail"
- However, the failure detection logic fires upgradeFailedAttemptRegistered
- This triggers two metadata writes (upgrade_failed_attempts_count + upgrade_failed_payment_ids)
- Metadata updates propagate through Effector stores, triggering reactive samples that re-check purchase status
- This creates a synchronous reactive cascade — no interval, no delay — generating hundreds of requests per second
Key Code Locations
| File | Lines | Description |
|---|---|---|
| modules/helpers.ts | 302-336 | Payment status mapping (auth_failed -> "Fail") |
| modules/upgrade/model/index.ts | 20-22 | Constants: MAX_UPGRADE_FAIL_ATTEMPTS = 5 |
| modules/upgrade/model/index.ts | 470-481 | Metadata refresh on route events |
| modules/upgrade/model/index.ts | 553-615 | Failed attempt detection and registration |
| modules/upgrade/model/index.ts | 617-633 | Metadata writes on failure (triggers cascade) |
| modules/upgrade/ui/premium-tools.ts | 300-313 | Source-based reactive check (fires on store changes) |
| backend/billing/gateway_handlers.go | 2671-2695 | Backend handler (no rate limiting) |
Why MAX_UPGRADE_FAIL_ATTEMPTS = 5 Didn't Help
The limit only prevents new purchase attempts (upgradePurchased event). It does not stop the reactive cascade that continuously checks purchase status and writes metadata. The counter reached 5,392 — far beyond the limit of 5.
Impact
- CPU: Node at 78%, api-server at 511m CPU from a single user
- Latency: Bimodal response times — some requests at ~30-40ms, others at ~600-700ms, indicating backend struggling under load
- Risk: api-server runs as a single replica. Sustained load could degrade service for all users
- Database: Each loop iteration writes to user_metadata table and reads from purchase history — sustained ~218 queries/second from one user
Recommended Remediation
- Immediate: Break the reactive cascade — add a guard in the failed attempt registration logic to stop re-checking once failed_attempts_count >= MAX_FAIL_ATTEMPTS
- Short-term: Add debounce/throttle to the metadata write samples to prevent tight loops
- Short-term: Add per-user rate limiting on /api/user/preferences and /api/billing/upgrade/check-status in the backend
- Medium-term: Review all Effector reactive chains for potential cascading loops
- Medium-term: Consider scaling api-server to multiple replicas for resilience
Want this level of investigation for your infrastructure?
Book a Demo →