
About Me
Hi, I'm Chaitanya Kakade, a Master's in Data Science candidate (M.S) at the University of Pennsylvania (UPenn) with expertise in building scalable machine learning systems and big data pipelines. My research focuses on data-centric AI, production-grade ML systems, Agentic AI, and multimodal learning. I graduated with a B.E. in Computer Engineering from the University of Mumbai with a 4.0 GPA and have published 10+ AI research papers (8+ IEEE conferences) in Deep Learning, Computer Vision, NLP, and Geoscience.
Seeking Summer 2026 Data Scientist / MLE / Applied Scientist / Data Engineering internships.
Currently, I'm a Graduate Research Assistant at the Wharton School under Dr. Anne Jamison, where I've engineered data mining and ETL pipelines processing large-scale ESG data using Python, SQL, and Apache Spark. I collaborate with PhD researchers globally to validate statistical and machine learning models and deliver reproducible insights for ESG research. Before Penn, I conducted research at Tata Consultancy Services under Dr. Shailesh Deshpande and with research scientists at IIT Bombay, where I implemented machine learning algorithms on geospatial and satellite data.
Along the way, I've become skilled in working with complex datasets, designing data pipelines, and building efficient models. Currently, my focus is on data-centric AI research and the design of production-grade machine learning systems. I'm also actively exploring emerging developments in Agentic AI, multimodal learning, and their integration into business applications.
Key Achievements
Let's Connect
Education
University of Pennsylvania
School of Engineering and Applied Sciences
Master of Science in Engineering in Data Science
University of Mumbai
Bachelor of Engineering, Computer Engineering
Technical Expertise
Programming Languages
ML/AI Frameworks
Cloud & Big Data
Tools & Platforms
Work Experience
Data Science (AWS) Graduate Research Assistant
The Wharton School
Key Achievements
- Built a scalable automation tool for ETL on 20M+ rows of ESG data across 150+ companies, improving experiment throughput by 40% and collaborating with 4 PhD researchers and faculty to formulate statistical approaches
- Conducted A/B testing across 15 experiments and participated in peer review (code reviews, feedback cycles, experiment reruns) to improve code correctness, reproducibility by 35%, and automate 85% of data preprocessing tasks
Research Intern
University of Mumbai
Key Achievements
- Architected a disentangled content-style representation framework for unpaired MG to BUS translation, achieving 90.3% pathology consistency and 94.5% diagnostic accuracy, significantly reducing clinical hallucinations by 79%
- Optimized a multi-objective loss function combining adversarial, KL-divergence, content cycle, and LPIPS on A100 GPUs, validating lesion morphology preservation and authoring a paper under review at IEEE ISBI 2026
Research Collaboration
Tata Consultancy Services (TCS)
Key Achievements
- Implemented a novel image segmentation workflow in QGIS using Python scripts on Landsat Satellite data, cutting analysis time by 60%
- Produced more accurate burn severity maps estimating 25.8M tonnes of carbon emissions
- Showcased research at IEEE IGARSS 2024 in Athens, captivating over 500 academics and industry experts
- Generated 12 follow-up conversations regarding potential workflow adoption
Data Science Intern
Uniconverge Technologies Pvt. Ltd.
Key Achievements
- Designed experiments and developed a clustered data processing pipeline on Hadoop to test and validate anomaly detection models, improving prediction accuracy by 15% and estimating ROI of $65k/year
- Led the end-to-end modeling lifecycle in a 6-member team, transforming raw telemetry data into 20+ statistical features and validating a high-precision SVM model (92% accuracy) to minimize downtime for 10+ monitored machines
Data Science Intern
PHN Technologies Pvt. Ltd.
Key Achievements
- Processed 1M+ records from AWS S3 using Python, Pandas, SQL and enforced schema-type checks and data-quality rules and engineered time-window and interaction features, and reduced missing/incorrect fields by 25%
- Built regression stack in SageMaker (Elastic Net, XGBoost, Quantile Regression) with nested CV, Bayesian tuning, and SHAP, delivered 92% hold-out accuracy and production dashboards in QuickSight, reducing report prep time by 30%
Featured Projects
Showcasing innovative solutions in AI, Machine Learning, and Data Science
Motor Crash Severity Diagnosis with Macroeconomic Unemployment Indicators
Processed and filtered 5.24M+ crash records using PySpark, Hadoop, SQL, and MLflow; implemented clustered data processing on distributed clusters handling 500K+ rows per batch, reducing analysis time by 60%. Trained an ensemble of three classifiers (XGBoost, Random Forest, LightGBM) with custom preprocessing pipelines for mixed-type features, achieving 93.4% accuracy and 0.93 weighted F1-score via memory-efficient sparse matrix operations, delivering a 5.2% improvement over a linear baseline after iterative optimization.
Virtual Try-On: AI-Driven Fashion Technology
Developed a 2D upper-body virtual try-on pipeline using diffusion-based inpainting with Stable Diffusion, SAM (Segment Anything Model) for precise garment masking, and advanced image processing techniques. Addressed the critical challenge of online fashion returns (30-40%) by enabling customers to virtually try on clothing. Implemented pose keypoint alignment, thin-plate-spline warping, and ControlNet for structural guidance to achieve realistic garment fitting while preserving identity, pose, and background.

Border Surveillance System with AI-Driven Thermal Vision
Developed an AI-driven border surveillance system using thermal and night vision with modified Faster R-CNN architecture.
DocBot - Disease Prediction System
Built a comprehensive disease prediction system using machine learning on medical symptoms dataset. Achieved 97.3% accuracy with multiple classification models deployed on AWS Cloud.
MotionScript: Sign Language to Text Converter
Developed a multi-headed CNN system for real-time sign language recognition with 96% accuracy. Integrated Google F5 TLAN for NLP, reducing translation lag by 95% and achieving 2nd Runner-Up at IIT Bombay.
Small Language Model Implementation with Resource Optimization
Built a small language model using PyTorch, trained and tested on the Encyclopedia Britannica and TinyStories datasets. Used a BPE tokenizer, efficient binary storage, a 6-layer Transformer with multi-head attention, and mixed precision training on M3 Pro GPU. Achieved stable training with AdamW and cosine annealing, generating coherent text outputs.
Advanced Image Reconstruction Autoencoders
Developed a convolutional autoencoder for image reconstruction, achieving 0.92 SSIM and 38.5 dB PSNR. Implemented multi-stage convolutional and upsampling layers, reducing dimensionality by 75% while preserving 90% visual information. Conducted comparative analysis across epochs, optimizing reconstruction error from 0.0156 to 0.0021.
A Minimal MCP Client & Server Demo
A simple repository that shows the process of building an MCP server and using Claude Desktop as a client. Features a Travel Desk system to handle employee travel requests, approvals, and history tracking — all accessible directly from Claude. Demonstrates how to modify the contents to develop specific MCP use cases.
Llama-2-7B-GGML-Powered Blog Generator
Using the advanced Llama 2 7B Chat model by Meta, this project offers a seamless experience for generating high-quality blogs with just a few clicks. Features AI-powered blog generation, customizable writing styles (Fun, General, Professional), word count specification, and a user-friendly Streamlit-based web interface.
Online Retail Analysis using Fireducks
Comprehensive analysis of online retail data demonstrating how Fireducks significantly speeds up data processing and analysis compared to traditional methods. Showcases performance improvements in data manipulation, aggregation, and visualization for retail analytics.
Research Publications
Peer-reviewed research in AI, Machine Learning, and Data Science
Knowledge Graph Guided SpatioTemporal Soil Nutrient Synthesis Using a Conditional Variational Autoencoder
C. Kakade, K. Patil, U. Bharambe, C. Mahajan
Pathology Preserving Cross Modal Translation via Dis-entangled Content Style Representation for Mammography and Breast Ultrasound
C. Kakade, U. Bharambe
A Multimodal Framework for Spatiotemporal Causal Analysis of Mumbai's Air Pollution Using Social Media Insights and Remote Sensing
C. Kakade, K. Patil, U. Bharambe, C. Mahajan
Optimal Detection of Diabetic Retinopathy Severity Levels Using Attention-Based CNN and Vision Transformers (ViT)
C. Kakade, S. Gupta, L. Chen
Effectiveness of Kolmogorov-Arnold Networks (KANs) and Analysis of Machine Learning Algorithms in Heart Disease Prediction
C. Kakade, A. Sharma, R. Patel
Evaluating Real-NVP For Spatio-Temporal Modelling In Synthetic Soil Nutrient Data Generation
C. Kakade, M. Singh, P. Kumar
Enhancing Sign Language Interpretation with Multi-Headed CNN, Hand Landmarks and Large Language Model (LLM)
C. Kakade, N. Kadam, V. Kaira, R. Kewalya
Carbon Emission Estimation in Sahyadri (Western Ghats) Resulting from Burning Grassland Biomass
C. Kakade, K. Patil, U. Bharambe, C. Mahajan
Predictive Analytics for Enhancing Crop Yield using Generative Adversarial Networks and its Challenges
C. Kakade, R. Verma, A. Kumar
MotionScript: Sign Language to Voice Converter
C. Kakade, N. Kadam, V. Kaira, R. Kewalya
Loading articles...
Awards & Recognition
Celebrating achievements in academic excellence, research competitions, and professional speaking engagements
Principal's Excellence Award
Awarded to the top 5% of students for academic and research excellence
DIPEX 2025 Project Exhibition - Top 30
Ranked Top 30 of 5,000+ teams at DIPEX 2025 with a functional prototype, presented to investors and India's Defence Research and Development Organisation (DRDO)
U' LECTRO '25 National Level Project Expo - 1st Position
Secured 1st Position in the AI/ML Domain
Guest Speaker – Deep Learning Workshop
Led a 3-hour session on Deep Learning, covering neural networks, backpropagation, optimizers, and advanced topics including LLMs and Transformers for real-world AI applications
Guest Speaker – QGIS & Machine Learning for Research
Delivered a session on applying QGIS and ML concepts to solve real-world problems and support research initiatives
Invited Speaker - Geospatial Computing and Applications
Delivered an in-depth demonstration of real-time data visualization and analytics from IoT sensors suspended in a water body using QGIS, demonstrating temperature patterns and insights
Invited Speaker - Workshop on Analyzing Vegetation Health with ML
Demonstrated practical applications of Machine Learning, Geospatial data, and Remote Sensing, showcasing live examples of how to use satellite imagery and tools like Google Earth Engine and QGIS to address real-world challenges
ResCon 2024, Research Presentation Competition - IIT Bombay
Secured 3rd place among 100+ teams

Techno Kagaz 2024, Research Conference
2nd Runner Up
Get In Touch
Let's discuss your data science needs or collaborate on exciting projects
Let's Connect
I'm always interested in hearing about new opportunities, interesting projects, or just having a chat about data science.
kakadechaitanya77@gmail.com
Phone
(267) 258-6268
Location
Philadelphia, PA
