University of Pennsylvania
School of Engineering and Applied Science
M.S.E. in Data Science

Data Scientist · ML Researcher
M.S.E. Data Science · University of Pennsylvania
Graduate Research Assistant at the Wharton School · AI & ML research with a focus on computer vision and applied ML systems.
Seeking Fall 2026 Data Scientist / MLE / Applied Scientist / Data Engineering internships and Full-Time 2027 opportunities.
About
I'm an M.S.E. in Data Science candidate at the University of Pennsylvania, focused on data-centric AI, production-grade ML systems, and multimodal learning. I graduated B.E. in Computer Engineering from the University of Mumbai with a 4.0 GPA and have authored 10+ AI research papers across deep learning, computer vision, NLP, and geoscience, with 8+ at IEEE conferences.
This summer I'll be joining Cotiviti as a Generative AI Developer Intern, researching agentic AI pipelines and reinforcement learning systems for healthcare informatics, with a focus on emerging AGI capabilities applied to treatment, payment, and operations workflows, and developing Generative AI prototypes and analytical tooling to support clinical decision-making.
I'm currently a Graduate Research Assistant at the Wharton School under Dr. Anne Jamison, building ETL and data-mining pipelines over large-scale ESG data with Python, SQL, and Apache Spark, and collaborating with PhD researchers on reproducible statistical and ML analyses.
Before Penn, I conducted research at Tata Consultancy Services with Dr. Shailesh Deshpande, and with research scientists at IIT Bombay, applying ML to geospatial and satellite data. My current interests sit at the intersection of data-centric evaluation, agentic AI, and multimodal systems.
Academic Background
School of Engineering and Applied Science
M.S.E. in Data Science

Thadomal Shahani Engineering College
B.E. in Computer Engineering

Technical Stack
The stack I reach for across research, production ML, and data infrastructure work.
Experience
Research and applied data science roles spanning agentic AI and healthcare informatics, AWS-based ESG analytics, medical imaging, geospatial ML, and large-scale ETL.
Summer 2026 · South Jordan, UT
Cotiviti·Internship

Oct 2025 – Present · Philadelphia, PA, USA
The Wharton School·Research
Aug 2023 – Jun 2025 · Mumbai, India
University of Mumbai·Research
Dec 2023 – Mar 2024 · Mumbai, India
Tata Consultancy Services (TCS)·Research
Jun 2023 – Aug 2023 · Mumbai, India
Uniconverge Technologies Pvt. Ltd.·Internship
Apr 2023 – May 2023 · Mumbai, India
PHN Technologies Pvt. Ltd.·Internship
Selected Work
Applied research and engineering across machine learning, computer vision, and large-scale data systems.
Data Engineering
CIS 5500Built a data analytics platform integrating 7M+ Yelp reviews, 150K businesses, and U.S. Census economic indicators across 33K ZIP Code Tabulation Areas to quantify how neighborhood income, density, and reviewer behavior shape restaurant outcomes. Engineered a chunked ETL pipeline in Python and Pandas over a 5.34GB review corpus into a 9-table BCNF-normalized PostgreSQL schema on AWS RDS. Authored 10 production analytical SQL queries using recursive CTEs, window functions (RANK, NTILE(4), PARTITION BY), and aggregate-before-join optimizations. Reduced critical query latency by ~99% (54s → 487ms; 7.8s → 431ms; 11s → 1.3s) via query restructuring, partial/composite B-tree indexes, and materialized views — validated through EXPLAIN ANALYZE. Implemented a vector search pipeline with all-MiniLM-L6-v2 sentence embeddings stored in pgvector for cosine-similarity retrieval. Modeled reviewer bias and surfaced systematic generous-vs-harsh behavior as a foundation for credibility weighting.
Machine Learning
CIS 5450Processed and filtered 5.24M+ crash records using PySpark, Hadoop, SQL, and MLflow; implemented clustered data processing on distributed clusters handling 500K+ rows per batch, reducing analysis time by 60%. Trained an ensemble of three classifiers (XGBoost, Random Forest, LightGBM) with custom preprocessing pipelines for mixed-type features, achieving 93.4% accuracy and 0.93 weighted F1-score via memory-efficient sparse matrix operations, delivering a 5.2% improvement over a linear baseline after iterative optimization.
Computer Vision
CIS 5810Developed a 2D upper-body virtual try-on pipeline using diffusion-based inpainting with Stable Diffusion, SAM (Segment Anything Model) for precise garment masking, and advanced image processing techniques. Addressed the critical challenge of online fashion returns (30-40%) by enabling customers to virtually try on clothing. Implemented pose keypoint alignment, thin-plate-spline warping, and ControlNet for structural guidance to achieve realistic garment fitting while preserving identity, pose, and background.

Computer Vision
Final Year B.E. ThesisDeveloped an AI-driven border surveillance system using thermal and night vision with modified Faster R-CNN architecture.
Machine Learning
Built a comprehensive disease prediction system using machine learning on medical symptoms dataset. Achieved 97.3% accuracy with multiple classification models deployed on AWS Cloud.
Deep Learning
Developed a multi-headed CNN system for real-time sign language recognition with 96% accuracy. Integrated Google F5 TLAN for NLP, reducing translation lag by 95% and achieving 2nd Runner-Up at IIT Bombay.
LLM Systems
Built a page-aware RAG ingestion pipeline using PyMuPDF that extracts native PDF text with page-level metadata (character/word counts and token budget estimates) to support traceable retrieval. Implemented a fast, deterministic fixed-size chunking baseline with page and chunk indexing to enable consistent retrieval, easier auditing, and reproducible evaluation across runs. Designed the system to fit downstream into vector-store-backed QA workflows with predictable token economics.
Deep Learning
Built a 6-layer GPT Transformer of 12M parameters in PyTorch, trained and tested on Encyclopedia Britannica and TinyStories. Formulated tokenization approaches (BPE) on 50K+ text samples, used efficient binary storage and mixed-precision training on an M3 Pro GPU. Achieved stable training with AdamW and cosine annealing, generating coherent text outputs and validating performance against larger benchmarks. (4K+ views on the accompanying Medium article.)
Deep Learning
Developed a convolutional autoencoder for image reconstruction, achieving 0.92 SSIM and 38.5 dB PSNR. Implemented multi-stage convolutional and upsampling layers, reducing dimensionality by 75% while preserving 90% visual information. Conducted comparative analysis across epochs, optimizing reconstruction error from 0.0156 to 0.0021.
Machine Learning
A simple repository that shows the process of building an MCP server and using Claude Desktop as a client. Features a Travel Desk system to handle employee travel requests, approvals, and history tracking — all accessible directly from Claude. Demonstrates how to modify the contents to develop specific MCP use cases.
Deep Learning
Using the advanced Llama 2 7B Chat model by Meta, this project offers a seamless experience for generating high-quality blogs with just a few clicks. Features AI-powered blog generation, customizable writing styles (Fun, General, Professional), word count specification, and a user-friendly Streamlit-based web interface.
Machine Learning
Comprehensive analysis of online retail data demonstrating how Fireducks significantly speeds up data processing and analysis compared to traditional methods. Showcases performance improvements in data manipulation, aggregation, and visualization for retail analytics.
Publications
Peer-reviewed work in AI, machine learning, computer vision, and Earth observation, with multiple IEEE conference appearances. Click any entry to read its abstract.
Loading articles...
Recognition
Academic, competition, and invited speaking recognitions across research and engineering work.
Thadomal Shahani Engineering College, University of Mumbai
Awarded the Silver Medal for graduating second in the department in the B.E. Computer Engineering program.
Thadomal Shahani Engineering College, University of Mumbai
Awarded to the top 0.1% of students for academic and research excellence
DIPEX 2025
Ranked Top 30 of 5,000+ teams at DIPEX 2025 with a functional prototype, presented to investors and India's Defence Research and Development Organisation (DRDO)
IETE-SF, MPSTME, NMIMS
Secured 1st Position in the AI/ML Domain
CSI-TSEC, University of Mumbai
Led a 3-hour session on Deep Learning, covering neural networks, backpropagation, optimizers, and advanced topics including LLMs and Transformers for real-world AI applications
CSI-TSEC, University of Mumbai
Delivered a session on applying QGIS and ML concepts to solve real-world problems and support research initiatives
Value Added Course
Delivered an in-depth demonstration of real-time data visualization and analytics from IoT sensors suspended in a water body using QGIS, demonstrating temperature patterns and insights
Computer Engineering Dept., Thadomal Shahani Engineering College
Demonstrated practical applications of Machine Learning, Geospatial data, and Remote Sensing, showcasing live examples of how to use satellite imagery and tools like Google Earth Engine and QGIS to address real-world challenges
EnPoWER, IIT Bombay
Secured 3rd place among 100+ teams

Shah and Anchor Kutchhi College of Engineering
2nd Runner Up
Contact
Open to research collaborations, internship opportunities for Fall 2026, and full-time roles for 2027.
I'm always open to research collaborations, internship conversations, or a chat about data science and ML.
kakadechaitanya77@gmail.com
Location
Philadelphia, PA, USA