Data Scientist · ML Researcher

Chaitanya Kakade

M.S.E. Data Science · University of Pennsylvania

Graduate Research Assistant at the Wharton School · AI & ML research with a focus on computer vision and applied ML systems.

Seeking Fall 2026 Data Scientist / MLE / Applied Scientist / Data Engineering internships and Full-Time 2027 opportunities.

Open to Fall 2026 internships & Full-Time 2027 opportunities
Chaitanya Kakade
Philadelphia, USA

About

Building research-grade ML systems with a focus on real-world impact.

I'm an M.S.E. in Data Science candidate at the University of Pennsylvania, focused on data-centric AI, production-grade ML systems, and multimodal learning. I graduated B.E. in Computer Engineering from the University of Mumbai with a 4.0 GPA and have authored 10+ AI research papers across deep learning, computer vision, NLP, and geoscience, with 8+ at IEEE conferences.

This summer I'll be joining Cotiviti as a Generative AI Developer Intern, researching agentic AI pipelines and reinforcement learning systems for healthcare informatics, with a focus on emerging AGI capabilities applied to treatment, payment, and operations workflows, and developing Generative AI prototypes and analytical tooling to support clinical decision-making.

I'm currently a Graduate Research Assistant at the Wharton School under Dr. Anne Jamison, building ETL and data-mining pipelines over large-scale ESG data with Python, SQL, and Apache Spark, and collaborating with PhD researchers on reproducible statistical and ML analyses.

Before Penn, I conducted research at Tata Consultancy Services with Dr. Shailesh Deshpande, and with research scientists at IIT Bombay, applying ML to geospatial and satellite data. My current interests sit at the intersection of data-centric evaluation, agentic AI, and multimodal systems.

10+
Publications
8+
IEEE conferences
14+
Projects shipped

Academic Background

Education

University of Pennsylvania

School of Engineering and Applied Science

M.S.E. in Data Science

University of Pennsylvania
May 2027Philadelphia, PA, USAGPA4.0 / 4.0
Coursework
Statistics for Data ScienceBig Data AnalyticsComputer VisionMachine Learning

University of Mumbai

Thadomal Shahani Engineering College

B.E. in Computer Engineering

University of Mumbai
May 2025Mumbai, IndiaGPA4.0 / 4.0Department Rank 2
Coursework
Advanced StatisticsDeep LearningMachine LearningNLPDatabase Management Systems

Technical Stack

Tools & Technologies

The stack I reach for across research, production ML, and data infrastructure work.

Languages
PythonSQLRMATLABBashC/C++Java
Databases
PostgreSQLpgvectorMongoDBNeo4jRedshiftBigQuery
ML / DL
PyTorchTensorFlowKerasHugging FaceScikit-LearnOpenCVXGBoostLightGBMPySparkPandasNumPySciPy
LLMs / Agents
LLMsVLMsRAGLangChainLangGraphMCPMultimodalStable DiffusionControlNet
AWS
S3EC2RDSSageMakerLambdaRedshiftQuickSight
Big data
Apache SparkHadoopDatabricksHive
Tooling
GitGitHub ActionsDockerKubernetesLinuxMLflowAirflowFastAPI
Visualization
MatplotlibSeabornPlotlyTableauPower BI

Experience

Work

Research and applied data science roles spanning agentic AI and healthcare informatics, AWS-based ESG analytics, medical imaging, geospatial ML, and large-scale ETL.

Previously atThe Wharton School·University of Mumbai·TCS Research·IIT Bombay

Summer 2026 · South Jordan, UT

Generative AI Developer Intern

Cotiviti·Internship

Cotiviti logo
  • Will be designing and shipping end-to-end agentic AI workflow pipelines for healthcare informatics, with tool-use, planning, and verifier loops aimed at reliable clinical and operational decision support.
  • Will be researching RL-based post-training (RLHF, DPO, and reward-model variants) to align LLM behavior on healthcare tasks under safety, traceability, and bias constraints.
  • Will be exploring agentic RAG architectures — retrieval-augmented multi-step reasoning over clinical, payment, and operations data — to push beyond single-shot prompting toward auditable AGI-grade workflows.

Oct 2025 – Present · Philadelphia, PA, USA

Data Science (AWS) Graduate Research Assistant

The Wharton School·Research

The Wharton School logo
  • Built a scalable automation tool for ETL on 20M+ rows of ESG data (500+ features) across 150+ companies on AWS, improving experiment throughput by 40% and collaborating globally with 8 PhD researchers and faculty to formulate statistical approaches
  • Designed experimental frameworks across 15 A/B variants with statistical modeling and ML approaches; built reproducible data pipelines with automated validation, improving reproducibility by 35% and automating 85% of data preprocessing tasks

Aug 2023 – Jun 2025 · Mumbai, India

Research Intern · Dr. Ujwala Bharambe

University of Mumbai·Research

  • Architected a disentangled content-style representation framework for unpaired MG to BUS translation, achieving 90.3% pathology consistency and 94.5% diagnostic accuracy, significantly reducing clinical hallucinations by 79%
  • Optimized a multi-objective loss function combining adversarial, KL-divergence, content cycle, reducing LPIPS from 0.118 0.103 on A100 GPUs, validating lesion morphology preservation and authoring a paper under review at IEEE ISBI 2026
  • Built a Knowledge Graph-guided Conditional VAE for spatiotemporal soil nutrient synthesis (accepted at IEEE IGARSS 2026); developed a multimodal framework integrating NLP (social media) with remote sensing imagery for causal air pollution analysis (IEEE IGARSS 2025)
  • Led 3 concurrent research tracks publishing 5+ IEEE papers across medical imaging, computer vision, NLP, and generative models (GANs, VAEs, Diffusion)

Dec 2023 – Mar 2024 · Mumbai, India

Research Intern · Dr. Shailesh Deshpande

Tata Consultancy Services (TCS)·Research

  • Applied data mining and ML algorithms to process and filter 100M+ multi-spectral satellite image pixels, reducing burn severity analysis time by 65% using clustered data processing on Apache Spark
  • Implemented a novel image segmentation workflow in QGIS using Python scripts on Landsat data; produced accurate burn severity maps estimating 25.8M tonnes of carbon emissions
  • Authored and presented a technical research paper at IEEE IGARSS 2024 in Athens; built dashboards and reports delivering insights to 50+ global stakeholders and 500+ academics, generating 12 follow-up conversations on workflow adoption

Jun 2023 – Aug 2023 · Mumbai, India

Data Science Intern

Uniconverge Technologies Pvt. Ltd.·Internship

  • Developed a clustered data processing pipeline on Hadoop using SQL and MapReduce to process 2M+ sensor data points for anomaly detection, improving prediction accuracy by 15% and projecting $65k/year in annual savings
  • Communicated with 50+ industrial customers and stakeholders to understand operations and business requirements, translating needs into technical approaches aligned with KPIs for predictive maintenance solutions
  • Led the end-to-end modeling lifecycle in a 6-member team, transforming raw telemetry data into 20+ statistical features and validating a high-precision SVM model (92% accuracy) to minimize downtime for 10+ monitored machines

Apr 2023 – May 2023 · Mumbai, India

Data Science Intern

PHN Technologies Pvt. Ltd.·Internship

  • Processed 1M+ records from AWS S3 using Python, Pandas, SQL and enforced schema-type checks and data-quality rules and engineered time-window and interaction features, and reduced missing/incorrect fields by 25%
  • Built regression stack in SageMaker (Elastic Net, XGBoost, Quantile Regression) with nested CV, Bayesian tuning, and SHAP, delivered 92% hold-out accuracy and production dashboards in QuickSight, reducing report prep time by 30%

Selected Work

Projects

Applied research and engineering across machine learning, computer vision, and large-scale data systems.

Data Engineering

University of PennsylvaniaCIS 5500

diningDB — Yelp × U.S. Census Data Application

Built a data analytics platform integrating 7M+ Yelp reviews, 150K businesses, and U.S. Census economic indicators across 33K ZIP Code Tabulation Areas to quantify how neighborhood income, density, and reviewer behavior shape restaurant outcomes. Engineered a chunked ETL pipeline in Python and Pandas over a 5.34GB review corpus into a 9-table BCNF-normalized PostgreSQL schema on AWS RDS. Authored 10 production analytical SQL queries using recursive CTEs, window functions (RANK, NTILE(4), PARTITION BY), and aggregate-before-join optimizations. Reduced critical query latency by ~99% (54s → 487ms; 7.8s → 431ms; 11s → 1.3s) via query restructuring, partial/composite B-tree indexes, and materialized views — validated through EXPLAIN ANALYZE. Implemented a vector search pipeline with all-MiniLM-L6-v2 sentence embeddings stored in pgvector for cosine-similarity retrieval. Modeled reviewer bias and surfaced systematic generous-vs-harsh behavior as a foundation for credibility weighting.

PostgreSQLAWS RDSpgvectorPythonPandasSQLNLPsentence-transformersNode.jsExpress

Machine Learning

University of PennsylvaniaCIS 5450

Motor Crash Severity Diagnosis with Macroeconomic Unemployment Indicators

Processed and filtered 5.24M+ crash records using PySpark, Hadoop, SQL, and MLflow; implemented clustered data processing on distributed clusters handling 500K+ rows per batch, reducing analysis time by 60%. Trained an ensemble of three classifiers (XGBoost, Random Forest, LightGBM) with custom preprocessing pipelines for mixed-type features, achieving 93.4% accuracy and 0.93 weighted F1-score via memory-efficient sparse matrix operations, delivering a 5.2% improvement over a linear baseline after iterative optimization.

Accuracy93.4%
F1-Score0.93
Performance2% improvement
PythonPySparkHadoopSQLMLflowXGBoostRandom ForestLightGBMSparse Matrices

Computer Vision

University of PennsylvaniaCIS 5810

Virtual Try-On: AI-Driven Fashion Technology

Developed a 2D upper-body virtual try-on pipeline using diffusion-based inpainting with Stable Diffusion, SAM (Segment Anything Model) for precise garment masking, and advanced image processing techniques. Addressed the critical challenge of online fashion returns (30-40%) by enabling customers to virtually try on clothing. Implemented pose keypoint alignment, thin-plate-spline warping, and ControlNet for structural guidance to achieve realistic garment fitting while preserving identity, pose, and background.

PythonStable DiffusionSAMControlNetIP-AdapterOpenCVPyTorchDiffusion Models
Border Surveillance System with AI-Driven Thermal Vision

Computer Vision

University of MumbaiFinal Year B.E. Thesis

Border Surveillance System with AI-Driven Thermal Vision

Developed an AI-driven border surveillance system using thermal and night vision with modified Faster R-CNN architecture.

PythonPyTorchFaster R-CNNOpenCVAWS S3CVATTop 30/5000 teams1st Prize at National Expo

Machine Learning

DocBot - Disease Prediction System

Built a comprehensive disease prediction system using machine learning on medical symptoms dataset. Achieved 97.3% accuracy with multiple classification models deployed on AWS Cloud.

Accuracy97.3%
PythonAWS SageMakerXGBoostSVMRandom ForestNeural Networks

Deep Learning

MotionScript: Sign Language to Text Converter

Developed a multi-headed CNN system for real-time sign language recognition with 96% accuracy. Integrated Google F5 TLAN for NLP, reducing translation lag by 95% and achieving 2nd Runner-Up at IIT Bombay.

Accuracy96%
PythonTensorFlowCNNGoogle F5 TLANOpenCVMediaPipe

LLM Systems

Retrieval-Augmented Generation Pipeline for Page-Aware PDF QA

Built a page-aware RAG ingestion pipeline using PyMuPDF that extracts native PDF text with page-level metadata (character/word counts and token budget estimates) to support traceable retrieval. Implemented a fast, deterministic fixed-size chunking baseline with page and chunk indexing to enable consistent retrieval, easier auditing, and reproducible evaluation across runs. Designed the system to fit downstream into vector-store-backed QA workflows with predictable token economics.

PythonPyMuPDFRAGEmbeddingsVector SearchChunkingEvaluation

Deep Learning

40

Small Language Model Implementation with Resource Optimization

Built a 6-layer GPT Transformer of 12M parameters in PyTorch, trained and tested on Encyclopedia Britannica and TinyStories. Formulated tokenization approaches (BPE) on 50K+ text samples, used efficient binary storage and mixed-precision training on an M3 Pro GPU. Achieved stable training with AdamW and cosine annealing, generating coherent text outputs and validating performance against larger benchmarks. (4K+ views on the accompanying Medium article.)

PythonPyTorchTransformerBPE TokenizerMixed PrecisionAdamWCosine Annealing

Deep Learning

Advanced Image Reconstruction Autoencoders

Developed a convolutional autoencoder for image reconstruction, achieving 0.92 SSIM and 38.5 dB PSNR. Implemented multi-stage convolutional and upsampling layers, reducing dimensionality by 75% while preserving 90% visual information. Conducted comparative analysis across epochs, optimizing reconstruction error from 0.0156 to 0.0021.

PythonTensorFlowAutoencoderConvolutional NetworksImage ProcessingSSIMPSNR

Machine Learning

A Minimal MCP Client & Server Demo

A simple repository that shows the process of building an MCP server and using Claude Desktop as a client. Features a Travel Desk system to handle employee travel requests, approvals, and history tracking — all accessible directly from Claude. Demonstrates how to modify the contents to develop specific MCP use cases.

PythonMCPClaude DesktopAPI DevelopmentTravel Management

Deep Learning

Llama-2-7B-GGML-Powered Blog Generator

Using the advanced Llama 2 7B Chat model by Meta, this project offers a seamless experience for generating high-quality blogs with just a few clicks. Features AI-powered blog generation, customizable writing styles (Fun, General, Professional), word count specification, and a user-friendly Streamlit-based web interface.

PythonLlama-2-7BStreamlitGGMLBlog GenerationNLP

Machine Learning

Online Retail Analysis using Fireducks

Comprehensive analysis of online retail data demonstrating how Fireducks significantly speeds up data processing and analysis compared to traditional methods. Showcases performance improvements in data manipulation, aggregation, and visualization for retail analytics.

PythonFireducksData AnalysisRetail AnalyticsPerformance Optimization

Publications

Research

Peer-reviewed work in AI, machine learning, computer vision, and Earth observation, with multiple IEEE conference appearances. Click any entry to read its abstract.

11
Publications
9
IEEE Conferences
6
Published
8
Citations

Loading articles...

Recognition

Awards & Honors

Academic, competition, and invited speaking recognitions across research and engineering work.

  1. Mar 2026
    Academic Excellence

    Silver Medal — Department Rank 2

    Thadomal Shahani Engineering College, University of Mumbai

    Awarded the Silver Medal for graduating second in the department in the B.E. Computer Engineering program.

  2. Mar 2025
    Academic Excellence

    Principal's Excellence Award

    Thadomal Shahani Engineering College, University of Mumbai

    Awarded to the top 0.1% of students for academic and research excellence

  3. Mar 2025
    Competition

    DIPEX 2025 Project Exhibition - Top 30

    DIPEX 2025

    Ranked Top 30 of 5,000+ teams at DIPEX 2025 with a functional prototype, presented to investors and India's Defence Research and Development Organisation (DRDO)

  4. Mar 2025
    Competition

    U' LECTRO '25 National Level Project Expo - 1st Position

    IETE-SF, MPSTME, NMIMS

    Secured 1st Position in the AI/ML Domain

  5. Aug 2025
    Speaking

    Guest Speaker – Deep Learning Workshop

    CSI-TSEC, University of Mumbai

    Led a 3-hour session on Deep Learning, covering neural networks, backpropagation, optimizers, and advanced topics including LLMs and Transformers for real-world AI applications

  6. Aug 2025
    Speaking

    Guest Speaker – QGIS & Machine Learning for Research

    CSI-TSEC, University of Mumbai

    Delivered a session on applying QGIS and ML concepts to solve real-world problems and support research initiatives

  7. 2025
    Speaking

    Invited Speaker - Geospatial Computing and Applications

    Value Added Course

    Delivered an in-depth demonstration of real-time data visualization and analytics from IoT sensors suspended in a water body using QGIS, demonstrating temperature patterns and insights

  8. Sep 2024
    Speaking

    Invited Speaker - Workshop on Analyzing Vegetation Health with ML

    Computer Engineering Dept., Thadomal Shahani Engineering College

    Demonstrated practical applications of Machine Learning, Geospatial data, and Remote Sensing, showcasing live examples of how to use satellite imagery and tools like Google Earth Engine and QGIS to address real-world challenges

  9. Feb 2024
    Competition

    ResCon 2024, Research Presentation Competition - IIT Bombay

    EnPoWER, IIT Bombay

    Secured 3rd place among 100+ teams

    ResCon 2024, Research Presentation Competition - IIT Bombay Award
  10. Feb 2024
    Competition

    Techno Kagaz 2024, Research Conference

    Shah and Anchor Kutchhi College of Engineering

    2nd Runner Up

Contact

Let's connect

Open to research collaborations, internship opportunities for Fall 2026, and full-time roles for 2027.

I'm always open to research collaborations, internship conversations, or a chat about data science and ML.

Email

kakadechaitanya77@gmail.com

Location

Philadelphia, PA, USA