UPenn Logo

|

MSE in Data Science @ University of Pennsylvania

Data Science (AWS) RA @ Wharton • Research in AI & ML • Turning Data into Impactful Decisions

Actively Seeking Summer 2026 Opportunities
Chaitanya Kakade
Philadelphia, USA

About Me

Hi, I'm Chaitanya Kakade, a Master's in Data Science candidate (M.S) at the University of Pennsylvania (UPenn) with expertise in building scalable machine learning systems and big data pipelines. My research focuses on data-centric AI, production-grade ML systems, Agentic AI, and multimodal learning. I graduated with a B.E. in Computer Engineering from the University of Mumbai with a 4.0 GPA and have published 10+ AI research papers (8+ IEEE conferences) in Deep Learning, Computer Vision, NLP, and Geoscience.

Seeking Summer 2026 Data Scientist / MLE / Applied Scientist / Data Engineering internships.

Currently, I'm a Graduate Research Assistant at the Wharton School under Dr. Anne Jamison, where I've engineered data mining and ETL pipelines processing large-scale ESG data using Python, SQL, and Apache Spark. I collaborate with PhD researchers globally to validate statistical and machine learning models and deliver reproducible insights for ESG research. Before Penn, I conducted research at Tata Consultancy Services under Dr. Shailesh Deshpande and with research scientists at IIT Bombay, where I implemented machine learning algorithms on geospatial and satellite data.

Along the way, I've become skilled in working with complex datasets, designing data pipelines, and building efficient models. Currently, my focus is on data-centric AI research and the design of production-grade machine learning systems. I'm also actively exploring emerging developments in Agentic AI, multimodal learning, and their integration into business applications.

Key Achievements

14+
Projects Completed
8+
Publications
5+
Conferences
20+
Technologies

Let's Connect

Education

University of Pennsylvania

School of Engineering and Applied Sciences

Master of Science in Engineering in Data Science

GPA:4.0/4.0
Philadelphia, PA
May 2027
Coursework
Statistics for Data ScienceBig Data AnalyticsComputer VisionMachine Learning

University of Mumbai

Bachelor of Engineering, Computer Engineering

GPA:4.0/4.0
Rank:Department Rank 2
Mumbai, India
May 2025
Coursework
Advanced StatisticsDeep LearningMachine LearningNLPDatabase Management Systems

Technical Expertise

Programming Languages

PythonSQLRMATLABC/C++Java

ML/AI Frameworks

PyTorchTensorFlowKerasLangChainLangGraphScikit-LearnPandasNumPyMatplotlibSeabornPlotlyPySpark

Cloud & Big Data

AWSGCPS3EC2SageMakerRedshiftLambdaBigQueryDatabricksApache SparkHadoopHive

Tools & Platforms

GitDockerKubernetesMLflowAirflowFastAPITableauPower BIQuickSightMongoDB

Work Experience

Data Science (AWS) Graduate Research Assistant

The Wharton School

Oct 2025 – Present
Philadelphia, PA
Research
The Wharton School logo

Key Achievements

  • Built a scalable automation tool for ETL on 20M+ rows of ESG data across 150+ companies, improving experiment throughput by 40% and collaborating with 4 PhD researchers and faculty to formulate statistical approaches
  • Conducted A/B testing across 15 experiments and participated in peer review (code reviews, feedback cycles, experiment reruns) to improve code correctness, reproducibility by 35%, and automate 85% of data preprocessing tasks

Research Intern

University of Mumbai

Aug 2023 – Jun 2025
Mumbai, India
Research

Key Achievements

  • Architected a disentangled content-style representation framework for unpaired MG to BUS translation, achieving 90.3% pathology consistency and 94.5% diagnostic accuracy, significantly reducing clinical hallucinations by 79%
  • Optimized a multi-objective loss function combining adversarial, KL-divergence, content cycle, and LPIPS on A100 GPUs, validating lesion morphology preservation and authoring a paper under review at IEEE ISBI 2026

Research Collaboration

Tata Consultancy Services (TCS)

Dec 2023 – Mar 2024
Mumbai, India
Research

Key Achievements

  • Implemented a novel image segmentation workflow in QGIS using Python scripts on Landsat Satellite data, cutting analysis time by 60%
  • Produced more accurate burn severity maps estimating 25.8M tonnes of carbon emissions
  • Showcased research at IEEE IGARSS 2024 in Athens, captivating over 500 academics and industry experts
  • Generated 12 follow-up conversations regarding potential workflow adoption

Data Science Intern

Uniconverge Technologies Pvt. Ltd.

Jun 2023 – Aug 2023
Mumbai, India
Internship

Key Achievements

  • Designed experiments and developed a clustered data processing pipeline on Hadoop to test and validate anomaly detection models, improving prediction accuracy by 15% and estimating ROI of $65k/year
  • Led the end-to-end modeling lifecycle in a 6-member team, transforming raw telemetry data into 20+ statistical features and validating a high-precision SVM model (92% accuracy) to minimize downtime for 10+ monitored machines

Data Science Intern

PHN Technologies Pvt. Ltd.

Apr 2023 – May 2023
Mumbai, India
Internship

Key Achievements

  • Processed 1M+ records from AWS S3 using Python, Pandas, SQL and enforced schema-type checks and data-quality rules and engineered time-window and interaction features, and reduced missing/incorrect fields by 25%
  • Built regression stack in SageMaker (Elastic Net, XGBoost, Quantile Regression) with nested CV, Bayesian tuning, and SHAP, delivered 92% hold-out accuracy and production dashboards in QuickSight, reducing report prep time by 30%

Featured Projects

Showcasing innovative solutions in AI, Machine Learning, and Data Science

⭐ Featured

Motor Crash Severity Diagnosis with Macroeconomic Unemployment Indicators

Processed and filtered 5.24M+ crash records using PySpark, Hadoop, SQL, and MLflow; implemented clustered data processing on distributed clusters handling 500K+ rows per batch, reducing analysis time by 60%. Trained an ensemble of three classifiers (XGBoost, Random Forest, LightGBM) with custom preprocessing pipelines for mixed-type features, achieving 93.4% accuracy and 0.93 weighted F1-score via memory-efficient sparse matrix operations, delivering a 5.2% improvement over a linear baseline after iterative optimization.

Accuracy:93.4%
F1-Score:0.93
Performance:2% improvement
PythonPySparkHadoopSQLMLflowXGBoostRandom ForestLightGBMSparse Matrices
⭐ Featured

Virtual Try-On: AI-Driven Fashion Technology

Developed a 2D upper-body virtual try-on pipeline using diffusion-based inpainting with Stable Diffusion, SAM (Segment Anything Model) for precise garment masking, and advanced image processing techniques. Addressed the critical challenge of online fashion returns (30-40%) by enabling customers to virtually try on clothing. Implemented pose keypoint alignment, thin-plate-spline warping, and ControlNet for structural guidance to achieve realistic garment fitting while preserving identity, pose, and background.

PythonStable DiffusionSAMControlNetIP-AdapterOpenCVPyTorchDiffusion Models
Border Surveillance System with AI-Driven Thermal Vision
⭐ Featured

Border Surveillance System with AI-Driven Thermal Vision

Developed an AI-driven border surveillance system using thermal and night vision with modified Faster R-CNN architecture.

PythonPyTorchFaster R-CNNOpenCVAWS S3CVAT🏆 Top 30/5000 teams🏆 1st Prize at National Expo
⭐ Featured

DocBot - Disease Prediction System

Built a comprehensive disease prediction system using machine learning on medical symptoms dataset. Achieved 97.3% accuracy with multiple classification models deployed on AWS Cloud.

Accuracy:97.3%
PythonAWS SageMakerXGBoostSVMRandom ForestNeural Networks
⭐ Featured

MotionScript: Sign Language to Text Converter

Developed a multi-headed CNN system for real-time sign language recognition with 96% accuracy. Integrated Google F5 TLAN for NLP, reducing translation lag by 95% and achieving 2nd Runner-Up at IIT Bombay.

Accuracy:96%
PythonTensorFlowCNNGoogle F5 TLANOpenCVMediaPipe
⭐ Featured

Small Language Model Implementation with Resource Optimization

Built a small language model using PyTorch, trained and tested on the Encyclopedia Britannica and TinyStories datasets. Used a BPE tokenizer, efficient binary storage, a 6-layer Transformer with multi-head attention, and mixed precision training on M3 Pro GPU. Achieved stable training with AdamW and cosine annealing, generating coherent text outputs.

PythonPyTorchTransformerBPE TokenizerMixed PrecisionAdamWCosine Annealing
⭐ Featured

Advanced Image Reconstruction Autoencoders

Developed a convolutional autoencoder for image reconstruction, achieving 0.92 SSIM and 38.5 dB PSNR. Implemented multi-stage convolutional and upsampling layers, reducing dimensionality by 75% while preserving 90% visual information. Conducted comparative analysis across epochs, optimizing reconstruction error from 0.0156 to 0.0021.

PythonTensorFlowAutoencoderConvolutional NetworksImage ProcessingSSIMPSNR
⭐ Featured

A Minimal MCP Client & Server Demo

A simple repository that shows the process of building an MCP server and using Claude Desktop as a client. Features a Travel Desk system to handle employee travel requests, approvals, and history tracking — all accessible directly from Claude. Demonstrates how to modify the contents to develop specific MCP use cases.

PythonMCPClaude DesktopAPI DevelopmentTravel Management
⭐ Featured

Llama-2-7B-GGML-Powered Blog Generator

Using the advanced Llama 2 7B Chat model by Meta, this project offers a seamless experience for generating high-quality blogs with just a few clicks. Features AI-powered blog generation, customizable writing styles (Fun, General, Professional), word count specification, and a user-friendly Streamlit-based web interface.

PythonLlama-2-7BStreamlitGGMLBlog GenerationNLP
⭐ Featured

Online Retail Analysis using Fireducks

Comprehensive analysis of online retail data demonstrating how Fireducks significantly speeds up data processing and analysis compared to traditional methods. Showcases performance improvements in data manipulation, aggregation, and visualization for retail analytics.

PythonFireducksData AnalysisRetail AnalyticsPerformance Optimization

Research Publications

Peer-reviewed research in AI, Machine Learning, and Data Science

IEEE

Knowledge Graph Guided SpatioTemporal Soil Nutrient Synthesis Using a Conditional Variational Autoencoder

C. Kakade, K. Patil, U. Bharambe, C. Mahajan

IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)
Washington D.C, USA
2026
Under Review
IEEE

Pathology Preserving Cross Modal Translation via Dis-entangled Content Style Representation for Mammography and Breast Ultrasound

C. Kakade, U. Bharambe

IEEE International Symposium on Biomedical Imaging (ISBI 2026)
2026
Under Review
IEEE

A Multimodal Framework for Spatiotemporal Causal Analysis of Mumbai's Air Pollution Using Social Media Insights and Remote Sensing

C. Kakade, K. Patil, U. Bharambe, C. Mahajan

IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2025)
Brisbane, Australia
2025
Published
IEEE

Optimal Detection of Diabetic Retinopathy Severity Levels Using Attention-Based CNN and Vision Transformers (ViT)

C. Kakade, S. Gupta, L. Chen

IEEE International Conference on Modeling, Simulation & Intelligent Computing
Dubai, UAE
2025
PublishedPaper

Effectiveness of Kolmogorov-Arnold Networks (KANs) and Analysis of Machine Learning Algorithms in Heart Disease Prediction

C. Kakade, A. Sharma, R. Patel

Under Review
2025
Under Review

Evaluating Real-NVP For Spatio-Temporal Modelling In Synthetic Soil Nutrient Data Generation

C. Kakade, M. Singh, P. Kumar

Under Review
2025
Under Review
IEEE

Enhancing Sign Language Interpretation with Multi-Headed CNN, Hand Landmarks and Large Language Model (LLM)

C. Kakade, N. Kadam, V. Kaira, R. Kewalya

IEEE Future Machine Learning and Data Science (FMLDS 2024)
Sydney, Australia
Nov 20, 2024
PublishedPaper
IEEE

Carbon Emission Estimation in Sahyadri (Western Ghats) Resulting from Burning Grassland Biomass

C. Kakade, K. Patil, U. Bharambe, C. Mahajan

IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2024)
Athens, Greece
Jul 9, 2024
PublishedPaper
IEEE

Predictive Analytics for Enhancing Crop Yield using Generative Adversarial Networks and its Challenges

C. Kakade, R. Verma, A. Kumar

IEEE India Geoscience and Remote Sensing Symposium (InGARSS 2024)
NIT Goa, India
2024
Published

MotionScript: Sign Language to Voice Converter

C. Kakade, N. Kadam, V. Kaira, R. Kewalya

International Journal of Computer Applications (IJCA)
Foundation of Computer Science (FCS), NY, USA
Jan 30, 2024
PublishedPaper
10
Publications
7
IEEE Conferences
6
Published

Loading articles...

Awards & Recognition

Celebrating achievements in academic excellence, research competitions, and professional speaking engagements

Academic Excellence

Principal's Excellence Award

Thadomal Shahani Engineering College
Mar 2025

Awarded to the top 5% of students for academic and research excellence

Competition

DIPEX 2025 Project Exhibition - Top 30

DIPEX 2025
Mar 2025

Ranked Top 30 of 5,000+ teams at DIPEX 2025 with a functional prototype, presented to investors and India's Defence Research and Development Organisation (DRDO)

Competition

U' LECTRO '25 National Level Project Expo - 1st Position

IETE-SF, MPSTME, NMIMS
Mar 2025

Secured 1st Position in the AI/ML Domain

Speaking

Guest Speaker – Deep Learning Workshop

CSI-TSEC, University of Mumbai
Aug 2025

Led a 3-hour session on Deep Learning, covering neural networks, backpropagation, optimizers, and advanced topics including LLMs and Transformers for real-world AI applications

Speaking

Guest Speaker – QGIS & Machine Learning for Research

CSI-TSEC, University of Mumbai
Aug 2025

Delivered a session on applying QGIS and ML concepts to solve real-world problems and support research initiatives

Speaking

Invited Speaker - Geospatial Computing and Applications

Value Added Course
2025

Delivered an in-depth demonstration of real-time data visualization and analytics from IoT sensors suspended in a water body using QGIS, demonstrating temperature patterns and insights

Speaking

Invited Speaker - Workshop on Analyzing Vegetation Health with ML

Computer Engineering Dept., Thadomal Shahani Engineering College
Sep 2024

Demonstrated practical applications of Machine Learning, Geospatial data, and Remote Sensing, showcasing live examples of how to use satellite imagery and tools like Google Earth Engine and QGIS to address real-world challenges

Competition

ResCon 2024, Research Presentation Competition - IIT Bombay

EnPoWER, IIT Bombay
Feb 2024

Secured 3rd place among 100+ teams

ResCon 2024, Research Presentation Competition - IIT Bombay Award
Competition

Techno Kagaz 2024, Research Conference

Shah and Anchor Kutchhi College of Engineering
Feb 2024

2nd Runner Up

Get In Touch

Let's discuss your data science needs or collaborate on exciting projects

Let's Connect

I'm always interested in hearing about new opportunities, interesting projects, or just having a chat about data science.

Email

kakadechaitanya77@gmail.com

Phone

(267) 258-6268

Location

Philadelphia, PA