← Back to Services
INC-2026-01-29

Upgrade Payment Polling Loop Causing CPU Spike

Date2026-01-29
SeverityHigh
StatusResolved
Affected Serviceapi-server
Detected ByObservability system (CPU alert)
ClientACME Corp (EdTech platform)

Summary

A single user with a failed upgrade payment caused an Effector reactive cascade in the frontend, generating ~218 requests per second to two backend API endpoints. This drove the api-server pod to 511m CPU and the production node to 78% CPU utilization.

Timeline

  • ~15:00 UTC — Increased CPU usage observed on production node
  • 15:03 UTC — Investigation began. Node at 78% CPU (1492m/1920m)
  • 15:03 UTC — api-server pod identified as top consumer at 511m CPU
  • 15:03 UTC — Nginx ingress logs analyzed: ~218 RPS to dynamic endpoints
  • 15:03 UTC — Two endpoints identified as 87% of all traffic
  • 15:03 UTC — Single Opera GX user-agent responsible for 84% of requests
  • 15:05 UTC — App-backend logs confirmed single user from Portugal

Investigation Details

Node Status

NodeCPUMemory
prod-node-k8s-0778% (1492m)68% (4616Mi)
prod-node-k8s-0812% (238m)66% (4422Mi)

Pod CPU on Affected Node

PodCPU
api-server511m
queue-worker364m
landing-service44m
web-app22m

Note: api-server runs as a single replica with no redundancy.

Traffic Analysis

Total dynamic RPS (excluding frontend static): ~258 avg, 411 peak

EndpointRequests/min% of total
/api/user/preferences7,152 (~119/s)47.5%
/api/billing/upgrade/check-status5,951 (~99/s)39.5%
/api/telemetry/events/log631 (~11/s)4.2%
Everything else1,3358.8%

User-Agent Breakdown

User-AgentRequests%
Chrome/142 Opera GX/126 (Windows 10)16,19183.8%
All other UAs combined3,12516.2%

The Opera GX user-agent sent requests to only two endpoints (/api/user/preferences and /api/billing/upgrade/check-status), with zero requests to frontend/static resources — confirming this is not normal browser behavior.

Cloudflare Analysis

Only 4 Cloudflare proxy IPs observed for the Opera UA, indicating traffic from 1–2 Cloudflare edge locations (likely a single client):

Cloudflare IPRequests
172.71.xx.774,879
172.71.xx.784,483
172.71.xx.403,556
172.71.xx.393,273

Identified User

FieldValue
CountryPortugal (CF-Ipcountry: PT)
Cloudflare EdgeAMS (Amsterdam)
BrowserOpera GX 126 (Chromium 142), Windows 10
Page/workspace
Failed Attempts Counter5,392 (and rising)

Payment Status

The user has two upgrade products with auth_failed payment status:

Product IDOrder StatusTransaction Status
2c34****-7f23-****-abc2-****e0266e8bauth_failednone
b481****-7a78-****-b211-****2b08f351auth_failednone

Root Cause Analysis

The Bug

The frontend uses Effector (reactive state management). A reactive cascade creates a tight loop with no delay:

  • Route opened (/workspace) triggers checkUpgradePurchasedFx
  • Response returns auth_failed status for both products
  • checkUpgradeStatus() correctly maps auth_failed to "Fail"
  • The 5-second polling correctly stops on "Fail"
  • However, the failure detection logic fires upgradeFailedAttemptRegistered
  • This triggers two metadata writes (upgrade_failed_attempts_count + upgrade_failed_payment_ids)
  • Metadata updates propagate through Effector stores, triggering reactive samples that re-check purchase status
  • This creates a synchronous reactive cascade — no interval, no delay — generating hundreds of requests per second

Key Code Locations

FileLinesDescription
modules/helpers.ts302-336Payment status mapping (auth_failed -> "Fail")
modules/upgrade/model/index.ts20-22Constants: MAX_UPGRADE_FAIL_ATTEMPTS = 5
modules/upgrade/model/index.ts470-481Metadata refresh on route events
modules/upgrade/model/index.ts553-615Failed attempt detection and registration
modules/upgrade/model/index.ts617-633Metadata writes on failure (triggers cascade)
modules/upgrade/ui/premium-tools.ts300-313Source-based reactive check (fires on store changes)
backend/billing/gateway_handlers.go2671-2695Backend handler (no rate limiting)

Why MAX_UPGRADE_FAIL_ATTEMPTS = 5 Didn't Help

The limit only prevents new purchase attempts (upgradePurchased event). It does not stop the reactive cascade that continuously checks purchase status and writes metadata. The counter reached 5,392 — far beyond the limit of 5.

Impact

  • CPU: Node at 78%, api-server at 511m CPU from a single user
  • Latency: Bimodal response times — some requests at ~30-40ms, others at ~600-700ms, indicating backend struggling under load
  • Risk: api-server runs as a single replica. Sustained load could degrade service for all users
  • Database: Each loop iteration writes to user_metadata table and reads from purchase history — sustained ~218 queries/second from one user

Recommended Remediation

  • Immediate: Break the reactive cascade — add a guard in the failed attempt registration logic to stop re-checking once failed_attempts_count >= MAX_FAIL_ATTEMPTS
  • Short-term: Add debounce/throttle to the metadata write samples to prevent tight loops
  • Short-term: Add per-user rate limiting on /api/user/preferences and /api/billing/upgrade/check-status in the backend
  • Medium-term: Review all Effector reactive chains for potential cascading loops
  • Medium-term: Consider scaling api-server to multiple replicas for resilience

Want this level of investigation for your infrastructure?

Book a Demo →