← Back to Services
INC-2026-01-29

Upgrade Payment Polling Loop Causing CPU Spike

A customer’s production node spiked to 78% CPU — no deployment, no traffic surge, nothing in the changelogs. We traced the load to a single user’s browser. A failed upgrade payment had silently triggered a reactive loop in the frontend, flooding two API endpoints at 218 requests per second. Left unchecked, this would have degraded the service for every user on that node. We identified the root cause down to the exact line of code and delivered a remediation plan within five minutes.

Date2026-01-29
SeverityHigh
StatusResolved
Affected Serviceapi-server
Detected ByObservability system (CPU alert)
ClientACME Corp (EdTech platform)

Timeline

  • ~15:00 UTC — Increased CPU usage observed on production node
  • 15:03 UTC — Investigation began. Node at 78% CPU (1492m/1920m)
  • 15:03 UTC — api-server pod identified as top consumer at 511m CPU
  • 15:03 UTC — Nginx ingress logs analyzed: ~218 RPS to dynamic endpoints
  • 15:03 UTC — Two endpoints identified as 87% of all traffic
  • 15:03 UTC — Single Opera GX user-agent responsible for 84% of requests
  • 15:05 UTC — App-backend logs confirmed single user from Portugal

Investigation Details

Node Status

NodeCPUMemory
prod-node-k8s-0778% (1492m)68% (4616Mi)
prod-node-k8s-0812% (238m)66% (4422Mi)

Pod CPU on Affected Node

PodCPU
api-server511m
queue-worker364m
landing-service44m
web-app22m

Note: api-server runs as a single replica with no redundancy.

Traffic Analysis

Total dynamic RPS (excluding frontend static): ~258 avg, 411 peak

EndpointRequests/min% of total
/api/user/preferences7,152 (~119/s)47.5%
/api/billing/upgrade/check-status5,951 (~99/s)39.5%
/api/telemetry/events/log631 (~11/s)4.2%
Everything else1,3358.8%

User-Agent Breakdown

User-AgentRequests%
Chrome/142 Opera GX/126 (Windows 10)16,19183.8%
All other UAs combined3,12516.2%

The Opera GX user-agent sent requests to only two endpoints (/api/user/preferences and /api/billing/upgrade/check-status), with zero requests to frontend/static resources — confirming this is not normal browser behavior.

Cloudflare Analysis

Only 4 Cloudflare proxy IPs observed for the Opera UA, indicating traffic from 1–2 Cloudflare edge locations (likely a single client):

Cloudflare IPRequests
172.71.xx.774,879
172.71.xx.784,483
172.71.xx.403,556
172.71.xx.393,273

Identified User

FieldValue
CountryPortugal (CF-Ipcountry: PT)
Cloudflare EdgeAMS (Amsterdam)
BrowserOpera GX 126 (Chromium 142), Windows 10
Page/workspace
Failed Attempts Counter5,392 (and rising)

Payment Status

The user has two upgrade products with auth_failed payment status:

Product IDOrder StatusTransaction Status
2c34****-7f23-****-abc2-****e0266e8bauth_failednone
b481****-7a78-****-b211-****2b08f351auth_failednone

Root Cause Analysis

The Bug

The frontend uses Effector (reactive state management). A reactive cascade creates a tight loop with no delay:

  • Route opened (/workspace) triggers checkUpgradePurchasedFx
  • Response returns auth_failed status for both products
  • checkUpgradeStatus() correctly maps auth_failed to "Fail"
  • The 5-second polling correctly stops on "Fail"
  • However, the failure detection logic fires upgradeFailedAttemptRegistered
  • This triggers two metadata writes (upgrade_failed_attempts_count + upgrade_failed_payment_ids)
  • Metadata updates propagate through Effector stores, triggering reactive samples that re-check purchase status
  • This creates a synchronous reactive cascade — no interval, no delay — generating hundreds of requests per second

Key Code Locations

FileLinesDescription
modules/helpers.ts302-336Payment status mapping (auth_failed -> "Fail")
modules/upgrade/model/index.ts20-22Constants: MAX_UPGRADE_FAIL_ATTEMPTS = 5
modules/upgrade/model/index.ts470-481Metadata refresh on route events
modules/upgrade/model/index.ts553-615Failed attempt detection and registration
modules/upgrade/model/index.ts617-633Metadata writes on failure (triggers cascade)
modules/upgrade/ui/premium-tools.ts300-313Source-based reactive check (fires on store changes)
backend/billing/gateway_handlers.go2671-2695Backend handler (no rate limiting)

Why MAX_UPGRADE_FAIL_ATTEMPTS = 5 Didn't Help

The limit only prevents new purchase attempts (upgradePurchased event). It does not stop the reactive cascade that continuously checks purchase status and writes metadata. The counter reached 5,392 — far beyond the limit of 5.

Impact

  • CPU: Node at 78%, api-server at 511m CPU from a single user
  • Latency: Bimodal response times — some requests at ~30-40ms, others at ~600-700ms, indicating backend struggling under load
  • Risk: api-server runs as a single replica. Sustained load could degrade service for all users
  • Database: Each loop iteration writes to user_metadata table and reads from purchase history — sustained ~218 queries/second from one user

Recommended Remediation

  • Immediate: Break the reactive cascade — add a guard in the failed attempt registration logic to stop re-checking once failed_attempts_count >= MAX_FAIL_ATTEMPTS
  • Short-term: Add debounce/throttle to the metadata write samples to prevent tight loops
  • Short-term: Add per-user rate limiting on /api/user/preferences and /api/billing/upgrade/check-status in the backend
  • Medium-term: Review all Effector reactive chains for potential cascading loops
  • Medium-term: Consider scaling api-server to multiple replicas for resilience

Want this level of investigation for your infrastructure?

Book a Demo →