Yash Kumar Lal Das - Software Engineer | Cloud-Native & DevOps (Python)

Cloud Engineer

I build and operate reliable cloud-native systems with automation, observability, and infrastructure as code. 3+ years in 24×7 production at TCS, plus hands-on systems and reporting work at UNT Libraries.

View Projects Resume Contact Me

Software EngineerPython DeveloperCloud Engineer (AWS)DevOps EngineerSite Reliability Engineer (SRE)Infrastructure AutomationTerraform + AWS IaCLinux (Production)

About

I'm a cloud and DevOps-focused engineer with 3+ years of hands-on experience supporting and improving production-grade systems in 24×7 environments. My background combines Linux operations, Python automation, cloud infrastructure, and reliability engineering practices.

I have worked extensively with monitoring, alert triage, incident response, and root-cause analysis in distributed systems. I focus on building observable, scalable, and resilient infrastructure using AWS services, Terraform, Kubernetes, and CI/CD pipelines.

In addition to operations, I build practical engineering projects involving serverless architectures, Kubernetes observability, and secure cloud networking to deepen my platform and automation expertise.

Open to Junior Cloud, DevOps, Linux, and Platform Engineering opportunities.

Master of Science in Information Systems & Technologies

University of North Texas

Data Analyst · Denton, Texas, USA

Systems Engineer

Tata Consultancy Services (TCS)

24×7 Production Systems · Python Automation · Linux · Monitoring & Incident Response

Projects

AWS Serverless Activity Ingestion

Problem: Needed scalable processing for high-volume activity data.
Solution: Built event-driven AWS Lambda pipeline writing to RDS, DynamoDB, and S3 with logging and monitoring.
Result: Delivered a fault-tolerant ingestion system handling burst traffic reliably.

AWS LambdaPythonRDS (MySQL)DynamoDBS3IAMVPCCloudWatchServerless ArchitectureEvent-Driven Design

View on GitHub →|AWS ARCHITECTURE Diagram

Kubernetes Observability & Auto-Scaling Platform

Problem: Services lacked visibility and required manual scaling.
Solution: Implemented Prometheus, Grafana, and HPA with load and failure testing.
Result: Achieved automated scaling and improved system reliability.

KubernetesDockerPrometheusGrafanaHPAHelmLinuxAuto ScalingMetrics MonitoringReliability Engineering

View on GitHub →|Prometheus Latency Graph

Secure AWS VPC 2-Tier Architecture

Problem: Required secure cloud network segmentation and controlled access.
Solution: Designed public/private VPC with NAT, routing controls, and layered security.
Result: Established isolated, secure architecture validated through testing.

VPCSubnetsRoute TablesIGWNATSecurity GroupsNACLsNetwork IsolationSecure Architecture

View on GitHub →|VPC Architecture Diagram

AI-Powered Project Planning Platform (PartyRock/Bedrock)

Problem: Manual planning was slow and inconsistent.
Solution: Used Amazon Bedrock to generate structured project plans from prompts.
Result: Reduced planning time and improved documentation consistency.

Amazon BedrockPartyRockPromptingPythonGenAI Workflow

View on GitHub →|Live Demo

Experience

"Reduced manual processing effort by ~40% through automation, documentation, and standardized workflows."

UNT — University of North Texas

Data Analyst, Part-Time

Standardized reporting and data workflows for academic services with reliable, repeatable processes.

•Automated reporting workflows using Excel + Ref Analytics, delivering 70+ reports/18 months and cutting turnaround ~40%.
•Produced 4 recurring reports/month by standardizing formats for 6 departments.
•Monitored web applications via Siteimprove, ensuring compliance and accessibility.
•Built dashboards tracking usage trends, improving visibility and reducing manual work.

ExcelRefAnalyticsSiteimproveDataWorkflowsReportingAutomation

"Improved incident response efficiency and system reliability by standardizing monitoring, automation, and operational procedures."

TCS — Tata Consultancy Services

Systems Engineer → Cloud & DevOps Engineer

Supported large-scale production systems focused on reliability and automation.

•Supported 24×7 cloud production systems, maintaining ~98% SLA.
•Triaged 150+ alerts/month, reducing MTTR ~25%.
•Automated operational checks using Python, cutting investigation effort ~40%.
•Performed RCA documentation, contributing to ~15% fewer repeat issues.

AWSPythonLinuxIncidentResponseObservabilityRCAProductionSupport

Skills

Cloud & Infrastructure

AWS (EC2, S3, VPC, Route 53, IAM)

CloudFront

Networking

DNS

Terraform (IaC)

VPC Architecture

IAM Policies

Network Security

DevOps & Automation

CI/CD (Jenkins, GitHub Actions)

automation tooling

repeatable deployments

version control

CI/CD Pipelines

GitHub Actions

Infrastructure Automation

Containers & Reliability

Docker

Kubernetes (EKS)

Helm

autoscaling

rolling updates

deployment strategies

High Availability

Failure Recovery

Observability & Operations

CloudWatch

Prometheus

Grafana

logging/alerting

incident response

runbooks

postmortems

MTTR Reduction

Alert Tuning

SLA Monitoring

Software Engineering & Automation

Python automation

Bash scripting

Git workflows

production-grade tooling for operations and infrastructure

Backend Automation

Operational Tooling

Contact

Dallas, USA

laldasyash@gmail.com +1 9405971297 LinkedIn GitHub

Download Resume