Overview
A developer-focused SaaS platform built on FastAPI, featuring async background job processing, subscription billing via Stripe, and OAuth2-based authentication. Designed for high throughput and operational reliability.
Role: Application Developer — full ownership of backend architecture, billing integration, auth system, and performance optimization.
Problem
The platform needed to handle compute-intensive background tasks (code analysis, generation) reliably, manage subscription billing with usage-based components, and authenticate users via multiple OAuth2 providers — all while maintaining sub-second response times for the web application.
Constraints
- Background jobs could take 30s–5min; users needed real-time progress feedback
- Stripe webhook handling required idempotent processing
- Multiple OAuth2 providers (GitHub, Google) with unified user model
- PostgreSQL performance at scale with complex queries
Solution
Async Task Pipeline
Built a Celery + Redis task pipeline for background processing:
- Task routing: Different queues for fast (< 5s) and slow (> 30s) tasks
- Retry policies: Exponential backoff with jitter, max 3 retries, dead-letter queue for manual inspection
- Progress tracking: Tasks publish progress updates to Redis pub/sub; the frontend polls via SSE
Stripe Billing Integration
Implemented full subscription lifecycle management:
- Checkout session creation with plan selection
- Webhook handler processing
invoice.paid,customer.subscription.updated,customer.subscription.deleted - Idempotent webhook processing using event ID deduplication
- Grace period handling for failed payments
OAuth2 Authentication
Unified auth system supporting multiple providers:
- OAuth2 authorization code flow with PKCE
- JWT access tokens (15-min TTL) + refresh tokens (7-day TTL, rotation on use)
- Account linking: users can connect multiple OAuth providers to one account
- Session invalidation on password change or security events
Deep Dives
Database Performance
As data grew, several queries degraded. Addressed through:
- Composite indexes on frequently filtered columns (user_id + status + created_at)
- Query plan analysis to identify sequential scans on large tables
- Application-level caching with Redis for hot data (user profiles, plan limits)
- Connection pooling via PgBouncer to handle connection spikes
Result: p95 query latency dropped from 120ms to 28ms for critical paths.
Webhook Reliability
Stripe webhooks can be delivered multiple times. Built a robust handler:
- Verify webhook signature (reject invalid payloads immediately)
- Check event ID against processed events table (idempotency)
- Process in a database transaction (atomicity)
- Return 200 before any side effects (email, notifications) to avoid timeout retries
Results
- 40% reduction in p95 API latency through database optimization
- Zero double-charge incidents from idempotent webhook processing
- 99.5% background job success rate with retry and dead-letter handling
- OAuth2 login supporting 3 providers with unified user experience
What I Learned
The biggest lesson was designing for failure from the start. Every external integration (Stripe, OAuth providers, background jobs) can and will fail. Building idempotency, retry logic, and dead-letter queues from day one saved significant debugging and incident response time later.