INC-2026-01-29

Upgrade Payment Polling Loop Causing CPU Spike

A customer’s production node spiked to 78% CPU — no deployment, no traffic surge, nothing in the changelogs. We traced the load to a single user’s browser. A failed upgrade payment had silently triggered a reactive loop in the frontend, flooding two API endpoints at 218 requests per second. Left unchecked, this would have degraded the service for every user on that node. We identified the root cause down to the exact line of code and delivered a remediation plan within five minutes.

Date2026-01-29

SeverityHigh

StatusResolved

Affected Serviceapi-server

Detected ByObservability system (CPU alert)

ClientACME Corp (EdTech platform)

Timeline

~15:00 UTC — Increased CPU usage observed on production node
15:03 UTC — Investigation began. Node at 78% CPU (1492m/1920m)
15:03 UTC — api-server pod identified as top consumer at 511m CPU
15:03 UTC — Nginx ingress logs analyzed: ~218 RPS to dynamic endpoints
15:03 UTC — Two endpoints identified as 87% of all traffic
15:03 UTC — Single Opera GX user-agent responsible for 84% of requests
15:05 UTC — App-backend logs confirmed single user from Portugal

Investigation Details

Node Status

Node	CPU	Memory
prod-node-k8s-07	78% (1492m)	68% (4616Mi)
prod-node-k8s-08	12% (238m)	66% (4422Mi)

Pod CPU on Affected Node

Pod	CPU
api-server	511m
queue-worker	364m
landing-service	44m
web-app	22m

Note: api-server runs as a single replica with no redundancy.

Traffic Analysis

Total dynamic RPS (excluding frontend static): ~258 avg, 411 peak

Endpoint	Requests/min	% of total
/api/user/preferences	7,152 (~119/s)	47.5%
/api/billing/upgrade/check-status	5,951 (~99/s)	39.5%
/api/telemetry/events/log	631 (~11/s)	4.2%
Everything else	1,335	8.8%

User-Agent Breakdown

User-Agent	Requests	%
Chrome/142 Opera GX/126 (Windows 10)	16,191	83.8%
All other UAs combined	3,125	16.2%

The Opera GX user-agent sent requests to only two endpoints (/api/user/preferences and /api/billing/upgrade/check-status), with zero requests to frontend/static resources — confirming this is not normal browser behavior.

Cloudflare Analysis

Only 4 Cloudflare proxy IPs observed for the Opera UA, indicating traffic from 1–2 Cloudflare edge locations (likely a single client):

Cloudflare IP	Requests
172.71.xx.77	4,879
172.71.xx.78	4,483
172.71.xx.40	3,556
172.71.xx.39	3,273

Identified User

Field	Value
Country	Portugal (CF-Ipcountry: PT)
Cloudflare Edge	AMS (Amsterdam)
Browser	Opera GX 126 (Chromium 142), Windows 10
Page	/workspace
Failed Attempts Counter	5,392 (and rising)

Payment Status

The user has two upgrade products with auth_failed payment status:

Product ID	Order Status	Transaction Status
2c34**-7f23--abc2-**e0266e8b	auth_failed	none
b481**-7a78--b211-**2b08f351	auth_failed	none

Root Cause Analysis

The Bug

The frontend uses Effector (reactive state management). A reactive cascade creates a tight loop with no delay:

Route opened (/workspace) triggers checkUpgradePurchasedFx
Response returns auth_failed status for both products
checkUpgradeStatus() correctly maps auth_failed to "Fail"
The 5-second polling correctly stops on "Fail"
However, the failure detection logic fires upgradeFailedAttemptRegistered
This triggers two metadata writes (upgrade_failed_attempts_count + upgrade_failed_payment_ids)
Metadata updates propagate through Effector stores, triggering reactive samples that re-check purchase status
This creates a synchronous reactive cascade — no interval, no delay — generating hundreds of requests per second

Key Code Locations

File	Lines	Description
modules/helpers.ts	302-336	Payment status mapping (auth_failed -> "Fail")
modules/upgrade/model/index.ts	20-22	Constants: MAX_UPGRADE_FAIL_ATTEMPTS = 5
modules/upgrade/model/index.ts	470-481	Metadata refresh on route events
modules/upgrade/model/index.ts	553-615	Failed attempt detection and registration
modules/upgrade/model/index.ts	617-633	Metadata writes on failure (triggers cascade)
modules/upgrade/ui/premium-tools.ts	300-313	Source-based reactive check (fires on store changes)
backend/billing/gateway_handlers.go	2671-2695	Backend handler (no rate limiting)

Why MAX_UPGRADE_FAIL_ATTEMPTS = 5 Didn't Help

The limit only prevents new purchase attempts (upgradePurchased event). It does not stop the reactive cascade that continuously checks purchase status and writes metadata. The counter reached 5,392 — far beyond the limit of 5.

Impact

CPU: Node at 78%, api-server at 511m CPU from a single user
Latency: Bimodal response times — some requests at ~30-40ms, others at ~600-700ms, indicating backend struggling under load
Risk: api-server runs as a single replica. Sustained load could degrade service for all users
Database: Each loop iteration writes to user_metadata table and reads from purchase history — sustained ~218 queries/second from one user

Recommended Remediation

Immediate: Break the reactive cascade — add a guard in the failed attempt registration logic to stop re-checking once failed_attempts_count >= MAX_FAIL_ATTEMPTS
Short-term: Add debounce/throttle to the metadata write samples to prevent tight loops
Short-term: Add per-user rate limiting on /api/user/preferences and /api/billing/upgrade/check-status in the backend
Medium-term: Review all Effector reactive chains for potential cascading loops
Medium-term: Consider scaling api-server to multiple replicas for resilience

Want this level of investigation for your infrastructure?

Book a Demo →