Yash Kumar Lal Das

Cloud Engineer

I build and operate reliable cloud-native systems with automation, observability, and infrastructure as code. 3+ years in 24×7 production at TCS, plus hands-on systems and reporting work at UNT Libraries.

Software EngineerPython DeveloperCloud Engineer (AWS)DevOps EngineerSite Reliability Engineer (SRE)Infrastructure AutomationTerraform + AWS IaCLinux (Production)

About

I'm a cloud and DevOps-focused engineer with 3+ years of hands-on experience supporting and improving production-grade systems in 24×7 environments. My background combines Linux operations, Python automation, cloud infrastructure, and reliability engineering practices.

I have worked extensively with monitoring, alert triage, incident response, and root-cause analysis in distributed systems. I focus on building observable, scalable, and resilient infrastructure using AWS services, Terraform, Kubernetes, and CI/CD pipelines.

In addition to operations, I build practical engineering projects involving serverless architectures, Kubernetes observability, and secure cloud networking to deepen my platform and automation expertise.

Open to Junior Cloud, DevOps, Linux, and Platform Engineering opportunities.

Master of Science in Information Systems & Technologies
University of North Texas
Data Analyst · Denton, Texas, USA
Systems Engineer
Tata Consultancy Services (TCS)
24×7 Production Systems · Python Automation · Linux · Monitoring & Incident Response

Projects

AWS Serverless Activity Ingestion

Problem: Needed scalable processing for high-volume activity data.
Solution: Built event-driven AWS Lambda pipeline writing to RDS, DynamoDB, and S3 with logging and monitoring.
Result: Delivered a fault-tolerant ingestion system handling burst traffic reliably.

AWS LambdaPythonRDS (MySQL)DynamoDBS3IAMVPCCloudWatchServerless ArchitectureEvent-Driven Design

Kubernetes Observability & Auto-Scaling Platform

Problem: Services lacked visibility and required manual scaling.
Solution: Implemented Prometheus, Grafana, and HPA with load and failure testing.
Result: Achieved automated scaling and improved system reliability.

KubernetesDockerPrometheusGrafanaHPAHelmLinuxAuto ScalingMetrics MonitoringReliability Engineering

Secure AWS VPC 2-Tier Architecture

Problem: Required secure cloud network segmentation and controlled access.
Solution: Designed public/private VPC with NAT, routing controls, and layered security.
Result: Established isolated, secure architecture validated through testing.

VPCSubnetsRoute TablesIGWNATSecurity GroupsNACLsNetwork IsolationSecure Architecture

AI-Powered Project Planning Platform (PartyRock/Bedrock)

Problem: Manual planning was slow and inconsistent.
Solution: Used Amazon Bedrock to generate structured project plans from prompts.
Result: Reduced planning time and improved documentation consistency.

Amazon BedrockPartyRockPromptingPythonGenAI Workflow

Experience

"Reduced manual processing effort by ~40% through automation, documentation, and standardized workflows."

UNT — University of North Texas

Data Analyst, Part-Time

Standardized reporting and data workflows for academic services with reliable, repeatable processes.

  • Automated reporting workflows using Excel + Ref Analytics, delivering 70+ reports/18 months and cutting turnaround ~40%.
  • Produced 4 recurring reports/month by standardizing formats for 6 departments.
  • Monitored web applications via Siteimprove, ensuring compliance and accessibility.
  • Built dashboards tracking usage trends, improving visibility and reducing manual work.
ExcelRefAnalyticsSiteimproveDataWorkflowsReportingAutomation

"Improved incident response efficiency and system reliability by standardizing monitoring, automation, and operational procedures."

TCS — Tata Consultancy Services

Systems Engineer → Cloud & DevOps Engineer

Supported large-scale production systems focused on reliability and automation.

  • Supported 24×7 cloud production systems, maintaining ~98% SLA.
  • Triaged 150+ alerts/month, reducing MTTR ~25%.
  • Automated operational checks using Python, cutting investigation effort ~40%.
  • Performed RCA documentation, contributing to ~15% fewer repeat issues.
AWSPythonLinuxIncidentResponseObservabilityRCAProductionSupport

Skills

Cloud & Infrastructure
AWS (EC2, S3, VPC, Route 53, IAM)
CloudFront
Networking
DNS
Terraform (IaC)
VPC Architecture
IAM Policies
Network Security
DevOps & Automation
CI/CD (Jenkins, GitHub Actions)
automation tooling
repeatable deployments
version control
CI/CD Pipelines
GitHub Actions
Infrastructure Automation
Containers & Reliability
Docker
Kubernetes (EKS)
Helm
autoscaling
rolling updates
deployment strategies
High Availability
Failure Recovery
Observability & Operations
CloudWatch
Prometheus
Grafana
logging/alerting
incident response
runbooks
postmortems
MTTR Reduction
Alert Tuning
SLA Monitoring
Software Engineering & Automation
Python automation
Bash scripting
Git workflows
production-grade tooling for operations and infrastructure
Backend Automation
Operational Tooling

Contact

© 2026 Yash Kumar Lal Das – Software Engineer | Cloud-Native & DevOps (Python)