Sarvesh Mishra

Sarvesh Mishra

Lead Platform Engineer | DevOps / MLOps | Site Reliability Engineer

Lead Platform Engineer with 4+ years of experience architecting scalable infrastructure, MLOps pipelines, and DevOps automation at CarTrade Tech (CarWale). Currently leading platform engineering initiatives — driving quarterly infra planning, setting architectural standards, and mentoring engineers. Reduced infrastructure costs by 38%, cut LLM expenditure by 30%, and saved $24K/year by migrating org-wide on-call tooling in-house. Google Cloud Professional Cloud Architect certified.


Certifications

Professional Cloud Architect

Professional Cloud Architect

Google Cloud

Technical Skills

Cloud & Infrastructure

AWS
GCP
Kubernetes
Terraform
S3
IAM
ARM Architecture
Karpenter

DevOps & CI/CD

Jenkins
GitHub Actions
Argo CD
Azure DevOps
Docker

Programming Languages

Go
Python
JavaScript
React
.NET

Databases

MySQL
MongoDB
PostgreSQL

Observability

Grafana Stack
Loki
Tempo
Grafana Faro
Grafana OnCall
Distributed Tracing

Messaging & Caching

Kafka
RabbitMQ
Redis
Memcached
Elasticsearch

AI / MLOps

MLflow AI Gateway
Agentic AI
n8n
Claude Code
Cursor
OpenAI
Hugging Face
Airflow

Platforms & Tools

Backstage
KRR
OpenCost
Jira
Confluence

Soft Skills

Technical Leadership
Mentorship
System Design
Communication
Problem Solving

Work Experience

Lead Platform Engineer

Mumbai, India
Apr 2026 – Present
Technical Leadership

Lead platform engineering initiatives across CarTrade Tech; drive quarterly and annual infra planning, define architectural standards in collaboration with engineering managers, and review all infra PRs and Terraform designs before production deployment.

TerraformKubernetesAWSLeadership
🚀 Setting org-wide architectural standards
Hiring & Interviews

Conduct system design interviews and DSA technical screenings for engineering candidates; contribute to hiring decisions for platform and backend engineering roles.

System DesignDSAHiring
🚀 Contributing to platform team growth
Mentorship

Mentor and guide an associate platform engineer on Kubernetes operations, IaC best practices, and incident response — accelerating their ramp-up on production systems.

KubernetesTerraformIaCMentorship
🚀 Accelerating engineer ramp-up on production systems

Platform Engineer / Software Engineer

Mumbai, India
Jun 2022 – Mar 2026
Kubernetes Cluster Management

Own and operate three EKS clusters (dev/staging/prod) across AWS multi-account architecture with Identity Tower; support 60+ microservices and 6 web apps spanning Mumbai (3 AZ) and Hyderabad regions for 80+ engineers.

EKSAWSKubernetesMulti-Account
🚀 Supporting 60+ microservices and 6 web apps for 80+ engineers
Kubernetes Upgrades & Node Lifecycle

Execute bi-annual zero-downtime Kubernetes version upgrades; oversee node group lifecycle and Karpenter-based autoscaling to continuously balance reliability and cost efficiency across environments.

KubernetesKarpenterEKSAWS
🚀 Zero-downtime upgrades across all environments
Karpenter & Custom CRDs

Implemented AWS Karpenter with custom CRDs for intelligent node provisioning, replacing managed node groups; improved cluster bin-packing efficiency and reduced over-provisioning cost.

KarpenterKubernetesAWSCRDs
🚀 Improved cluster bin-packing and reduced over-provisioning
Multi-Region DR & High Availability

Architected and maintain active-passive DR setup across Mumbai (3 AZ) and Hyderabad; conduct regular DR drills and failover validation to meet RPO/RTO targets.

AWSMulti-RegionDRHigh Availability
🚀 Meeting RPO/RTO targets with regular DR drills
Kubernetes Cost Optimization

Integrated KRR and OpenCost for continuous right-sizing of CPU/memory requests across all workloads; achieved 38% reduction in infrastructure spend while maintaining 99.99% uptime.

KubernetesKRROpenCost
🚀 Reduced infrastructure costs by 38% while maintaining 99.99% uptime
Helm Chart Management

Author and maintain Helm charts for platform tooling including Grafana Stack, Airflow, and internal services; standardized chart structure and release process across dev/staging/prod environments.

HelmKubernetesGrafanaAirflow
🚀 Standardized chart structure and release process
RBAC & Access Control

Enforce Kubernetes RBAC policies and LDAP/Active Directory integration for RabbitMQ, OpenSearch, and Redis; enforce least-privilege access across AWS multi-account environments via IAM Identity Center and Control Tower.

RBACKubernetesIAMAWS
🚀 Least-privilege access enforcement across all environments
Incident Management & SRE On-Call

Participate in bi-weekly on-call rotation for platform reliability; triage and resolve OOMKills, crashloops, pod evictions, and infra incidents across 60+ microservices and 6 web apps.

SREKubernetesIncident ManagementGrafana
🚀 Maintaining platform reliability across 60+ microservices
MLOps AI Gateway

Architected migration from AWS AI Gateway to open-source MLflow; implemented multi-provider traffic routing and token-usage governance.

AWSMLflowMLOpsAI
🚀 Reduced annual LLM costs by 30%
n8n Workflow Automation Platform

Deployed and productionized the n8n platform, enabling non-technical stakeholders (PMs) to construct ML workflows via drag-and-drop nodes, eliminating cross-team dependencies and ML team intervention.

n8nAutomationCI/CD
🚀 Eliminated engineering dependencies for workflow creation
Agentic AI Automation

Engineered an agentic AI layer in Jenkins to autonomously execute multi-repo operations — PR creation, dependency updates, testing, and linting.

JenkinsAIAutomationCI/CD
🚀 Reduced developer overhead by 20%
Terraform Infrastructure Management

Administer IaC for two VPCs (dev/staging and production) spanning multiple subnets, EC2 fleets, EKS clusters, and AWS services; maintain S3 remote state backend with DynamoDB locking for safe concurrent operations.

TerraformAWSIaCEKS
🚀 Managing full IaC lifecycle for two VPCs
GitOps Infra Workflow

Enforce PR-based infrastructure change workflow via GitHub Actions with Terraform plan output as PR comments; all production changes peer-reviewed before apply, reducing config drift and deployment risk.

GitHub ActionsTerraformGitOpsCI/CD
🚀 Reduced config drift and deployment risk
ARM Architecture Migration

Spearheaded migration of compute workloads to AWS Graviton ARM instances, cutting monthly infrastructure spend and improving backend service performance.

AWSARMEC2Graviton
🚀 Reduced monthly spend by 30% and improved performance by 20%
In-House OnCall App

Built from scratch a React + GoLang application replicating OpsGenie-grade scheduling, escalations, and notifications; migrated org-wide on-call management in-house.

ReactGoOpsGenie
🚀 Saved $24K/year
Backstage Developer Portal

Established Spotify Backstage as a centralized developer portal, cutting onboarding time and improving developer productivity across 80+ engineers.

BackstageTypeScriptInternal Tools
🚀 Reduced onboarding time by 40% and improved productivity by 25%
Monitoring & Observability

Established comprehensive monitoring, logging, and alerting systems using Grafana Stack with OnCall integration and distributed tracing, reducing incident response time by 30%.

GrafanaOnCallDistributed TracingLoki
🚀 Reduced incident response time by 30%
Kafka Migration

Led migration from legacy RabbitMQ to Kafka and created an internal messaging library, improving data delivery reliability and reducing consumer lag.

KafkaRabbitMQDistributed Systems
🚀 Reduced consumer lag by 22% and improved reliability by 15%
Rate Limiter Service

Launched a Redis-backed distributed rate limiter to throttle abusive traffic and monetize APIs, increasing revenue and improving API health score.

RedisDistributed SystemsAPI
🚀 Increased revenue by 18% and improved health score by 25%
Frontend Telemetry (Grafana Faro)

Integrated Grafana Faro for real-time client-side telemetry across React frontends, enabling end-to-end tracing from UI to backend.

Grafana FaroReactWeb VitalsDistributed Tracing
🚀 Improved frontend issue resolution by 35%
Internal Chatbot

Shipped a GPT-powered internal chatbot improving customer engagement via calling agents and reducing support ticket resolution time.

GPTPythonAI
🚀 Improved engagement by 12% and reduced resolution time by 5%
SSR Performance Optimisation

Re-architected SSR service into a Dockerized Node.js renderer with Redis caching; reduced cold-start latency, improved TTFB, and cut backend CPU usage.

ReactDockerNode.jsRedisC#
🚀 Reduced cold-start latency by 42%, improved TTFB by 35%, cut CPU usage by 20%

Education

Masai School

Certificate in Full-Stack Web Development (MERN Stack)

Oct 2021 – Jun 2022Bangalore, India

Dr. A.P.J. Abdul Kalam Technical University

Bachelor of Technology — Mechanical Engineering

Jun 2014 – Jun 2018Lucknow, India

Achievements

  • Rockstar Team of the Year — Annual Award 2025, CarTrade Tech.
  • Best Performer of the Year — Annual Award 2024, CarTrade Tech.
  • 1st Place, Internal Hackathon 2023 — Chaos Testing Integration Project.
  • Best Debutant of the Year — Annual Award 2023, CarTrade Tech.