Upgrade Payment Polling Loop Causing CPU Spike
Summary
A single user with a failed upgrade payment caused an Effector reactive cascade in the frontend, generating ~218 requests per second to two backend API endpoints. This drove the api-server pod to 511m CPU and the production node to 78% CPU utilization.
Timeline
- ~15:00 UTC — Increased CPU usage observed on production node
- 15:03 UTC — Investigation began. Node at 78% CPU (1492m/1920m)
- 15:03 UTC — api-server pod identified as top consumer at 511m CPU
- 15:03 UTC — Nginx ingress logs analyzed: ~218 RPS to dynamic endpoints
- 15:03 UTC — Two endpoints identified as 87% of all traffic
- 15:03 UTC — Single Opera GX user-agent responsible for 84% of requests
- 15:05 UTC — App-backend logs confirmed single user from Portugal
Investigation Details
Node Status
| Node | CPU | Memory |
|---|---|---|
| prod-node-k8s-07 | 78% (1492m) | 68% (4616Mi) |
| prod-node-k8s-08 | 12% (238m) | 66% (4422Mi) |
Pod CPU on Affected Node
| Pod | CPU |
|---|---|
| api-server | 511m |
| queue-worker | 364m |
| landing-service | 44m |
| web-app | 22m |
Note: api-server runs as a single replica with no redundancy.
Traffic Analysis
Total dynamic RPS (excluding frontend static): ~258 avg, 411 peak
| Endpoint | Requests/min | % of total |
|---|---|---|
| /api/user/preferences | 7,152 (~119/s) | 47.5% |
| /api/billing/upgrade/check-status | 5,951 (~99/s) | 39.5% |
| /api/telemetry/events/log | 631 (~11/s) | 4.2% |
| Everything else | 1,335 | 8.8% |
User-Agent Breakdown
| User-Agent | Requests | % |
|---|---|---|
| Chrome/142 Opera GX/126 (Windows 10) | 16,191 | 83.8% |
| All other UAs combined | 3,125 | 16.2% |
The Opera GX user-agent sent requests to only two endpoints (/api/user/preferences and /api/billing/upgrade/check-status), with zero requests to frontend/static resources — confirming this is not normal browser behavior.
Cloudflare Analysis
Only 4 Cloudflare proxy IPs observed for the Opera UA, indicating traffic from 1–2 Cloudflare edge locations (likely a single client):
| Cloudflare IP | Requests |
|---|---|
| 172.71.xx.77 | 4,879 |
| 172.71.xx.78 | 4,483 |
| 172.71.xx.40 | 3,556 |
| 172.71.xx.39 | 3,273 |
Identified User
| Field | Value |
|---|---|
| Country | Portugal (CF-Ipcountry: PT) |
| Cloudflare Edge | AMS (Amsterdam) |
| Browser | Opera GX 126 (Chromium 142), Windows 10 |
| Page | /workspace |
| Failed Attempts Counter | 5,392 (and rising) |
Payment Status
The user has two upgrade products with auth_failed payment status:
| Product ID | Order Status | Transaction Status |
|---|---|---|
| 2c34****-7f23-****-abc2-****e0266e8b | auth_failed | none |
| b481****-7a78-****-b211-****2b08f351 | auth_failed | none |
Root Cause Analysis
The Bug
The frontend uses Effector (reactive state management). A reactive cascade creates a tight loop with no delay:
- Route opened (/workspace) triggers checkUpgradePurchasedFx
- Response returns auth_failed status for both products
- checkUpgradeStatus() correctly maps auth_failed to "Fail"
- The 5-second polling correctly stops on "Fail"
- However, the failure detection logic fires upgradeFailedAttemptRegistered
- This triggers two metadata writes (upgrade_failed_attempts_count + upgrade_failed_payment_ids)
- Metadata updates propagate through Effector stores, triggering reactive samples that re-check purchase status
- This creates a synchronous reactive cascade — no interval, no delay — generating hundreds of requests per second
Key Code Locations
| File | Lines | Description |
|---|---|---|
| modules/helpers.ts | 302-336 | Payment status mapping (auth_failed -> "Fail") |
| modules/upgrade/model/index.ts | 20-22 | Constants: MAX_UPGRADE_FAIL_ATTEMPTS = 5 |
| modules/upgrade/model/index.ts | 470-481 | Metadata refresh on route events |
| modules/upgrade/model/index.ts | 553-615 | Failed attempt detection and registration |
| modules/upgrade/model/index.ts | 617-633 | Metadata writes on failure (triggers cascade) |
| modules/upgrade/ui/premium-tools.ts | 300-313 | Source-based reactive check (fires on store changes) |
| backend/billing/gateway_handlers.go | 2671-2695 | Backend handler (no rate limiting) |
Why MAX_UPGRADE_FAIL_ATTEMPTS = 5 Didn't Help
The limit only prevents new purchase attempts (upgradePurchased event). It does not stop the reactive cascade that continuously checks purchase status and writes metadata. The counter reached 5,392 — far beyond the limit of 5.
Impact
- CPU: Node at 78%, api-server at 511m CPU from a single user
- Latency: Bimodal response times — some requests at ~30-40ms, others at ~600-700ms, indicating backend struggling under load
- Risk: api-server runs as a single replica. Sustained load could degrade service for all users
- Database: Each loop iteration writes to user_metadata table and reads from purchase history — sustained ~218 queries/second from one user
Recommended Remediation
- Immediate: Break the reactive cascade — add a guard in the failed attempt registration logic to stop re-checking once failed_attempts_count >= MAX_FAIL_ATTEMPTS
- Short-term: Add debounce/throttle to the metadata write samples to prevent tight loops
- Short-term: Add per-user rate limiting on /api/user/preferences and /api/billing/upgrade/check-status in the backend
- Medium-term: Review all Effector reactive chains for potential cascading loops
- Medium-term: Consider scaling api-server to multiple replicas for resilience
Want this level of investigation for your infrastructure?
Book a Demo →