Machine Learning & Earth Foundation Models

Applying AI, Deep Learning, and Geospatial Foundation Models to Ocean & Earth Science

The convergence of big geospatial data, satellite remote sensing, and modern AI is transforming how we understand and predict the Earth system. I apply classical machine learning, deep neural networks, and the latest Earth/geospatial foundation models to oceanographic and environmental research — from water quality and phytoplankton classification to land use change and coastal hazard mapping.

Machine Learning Deep Learning Foundation Models Remote Sensing AI Python / PyTorch Oceanography

Classical Machine Learning in Oceanography

Traditional ML algorithms remain powerful workhorses for oceanographic data — especially when interpretability and efficiency matter. I routinely apply these methods to satellite-derived and in situ datasets.

Ensemble Methods

Random Forest (RF) — water quality parameter estimation, land cover classification, species distribution modeling
XGBoost / Gradient Boosting — hypoxia prediction in the Gulf of Mexico, Chl-a estimation
Extra Trees, AdaBoost — feature importance ranking in multi-sensor datasets

Random Forest XGBoost Gradient Boosting

Regression & Classification

Support Vector Machine (SVM) — remote sensing image classification, benthic habitat mapping
Gaussian Process Regression (GPR) — spatiotemporal interpolation of ocean variables
Ridge / Lasso / Elastic Net — multivariate water quality analysis
K-Nearest Neighbors, Naive Bayes — baseline comparisons for classification tasks

SVM GPR Regression

Clustering & Dimensionality Reduction

K-Means, DBSCAN — oceanographic regime identification, eddy detection
PCA / t-SNE / UMAP — high-dimensional satellite data exploration
Hierarchical Clustering — ecological community analysis (PRIMER methods)

PCA K-Means UMAP

Key Applications

Chlorophyll-a & water quality estimation from Sentinel-2 / Landsat imagery
Hypoxia prediction in the Gulf of Mexico using multi-variable ocean data
Land use / land cover change modeling (CA-ANN, RF, SVM)
Species distribution modeling — MaxEnt, BRT, RF
Discharge–Chl-a relationship analysis in coastal systems

Water Quality Land Cover SDM

Deep Learning

Deep neural networks unlock pattern recognition capabilities far beyond classical methods — critical for handling the scale and complexity of satellite imagery, time series, and autonomous sensor data in oceanographic research.

CNN

Convolutional Neural Networks (CNN)

Image classification, object detection, semantic segmentation

I have designed and trained custom CNNs and fine-tuned pre-trained architectures for oceanographic and ecological tasks. Key work includes:

Imaging FlowCytobot (IFC) plankton classification — CNN pipeline for automated phytoplankton species identification in the Mississippi Sound
Marine mammal & shark detection — VGG-19 + custom CNN on static imagery
Cattle behavior classification — VGG-19, custom CNN on HPC/supercomputer (MSU GRI × USDA internship, 2023)
Satellite image segmentation — land cover, surface water dynamics, delta mapping

VGG-19 ResNet Custom CNN PyTorch

RNN / LSTM

Recurrent Networks & Temporal Modeling

Time series forecasting, sequential ocean data

LSTMs and GRUs are well-suited for oceanographic time series — capturing long-range temporal dependencies in water quality records, tide gauge data, and climate indices.

Long-term water quality trend modeling in the Mississippi Sound
Sea surface temperature and chlorophyll-a forecasting
Discharge–productivity lag analysis in estuarine systems

LSTM GRU Time Series

Transformer

Transformers & Attention Mechanisms

Vision Transformers, multi-modal learning, sequence modeling

Transformers have redefined state-of-the-art in both NLP and computer vision. For Earth observation, Vision Transformers (ViT) and their derivatives enable powerful multi-scale, multi-temporal analysis of satellite imagery.

Vision Transformer (ViT) for satellite scene classification
Swin Transformer for high-resolution land cover mapping
Multi-temporal attention for change detection in coastal zones
Cross-modal fusion — combining optical + SAR + LiDAR data

ViT Swin Transformer Hugging Face

GAN / VAE

Generative Models

Data augmentation, super-resolution, synthetic data

GANs — synthetic satellite image generation for data-scarce regions, super-resolution downscaling of ocean model outputs
Variational Autoencoders (VAE) — anomaly detection in ocean time series (HABs, hypoxia events)
Diffusion Models — exploratory use for climate downscaling applications

GAN VAE Super-Resolution

Earth & Geospatial Foundation Models

Foundation models — large models pre-trained on massive datasets and fine-tuned for downstream tasks — are revolutionizing Earth observation and geoscience. These models enable few-shot learning, zero-shot generalization, and powerful feature extraction across diverse remote sensing data.

Prithvi (NASA × IBM)

A geospatial foundation model pre-trained on 6 years of global Harmonized Landsat Sentinel-2 (HLS) data. Prithvi uses a masked autoencoder (MAE) architecture and supports multi-temporal, multi-spectral inputs.

Flood mapping, burn scar detection, crop segmentation
Fine-tuning for coastal land cover change in Bangladesh and Gulf Coast
Multi-temporal change detection in estuarine systems

NASA / IBM HLS Data MAE Architecture HuggingFace

SatMAE & Scale-MAE

Masked Autoencoders adapted for satellite imagery. SatMAE handles temporal and multi-spectral sequences; Scale-MAE leverages ground sampling distance (GSD) to achieve scale-aware feature learning.

Pre-training on Sentinel-1/2, Landsat, and Planet imagery
Zero-shot transfer to ocean colour and water quality tasks
Scale-aware feature extraction across different sensor resolutions

SatMAE Scale-MAE Multi-Spectral

Segment Anything Model (SAM)

Meta's SAM provides promptable, zero-shot image segmentation. Adapted for remote sensing (GeoSAM, SAM-Geo), it dramatically accelerates annotation and segmentation of satellite imagery features.

Automated delineation of water bodies, wetlands, and coastal features
Rapid mangrove and seagrass bed extraction from high-res imagery
Semi-automated training data generation for supervised models
SAM 2 — video/temporal segmentation for surface water dynamics time series

SAM / SAM 2 GeoSAM Zero-Shot

Clay Foundation Model

An open-source Earth observation foundation model trained on multi-sensor satellite data (Sentinel, Landsat, NAIP, LINZ). Clay uses a Vision Transformer backbone with metadata embeddings for time, location, and sensor.

Embeddings for downstream water quality and ocean colour tasks
Coastal change detection with minimal labeled data
Semantic similarity search across large satellite archives

Clay ViT Backbone Open Source Clay Docs

Aurora & Pangu-Weather (Atmospheric FMs)

Large-scale atmospheric and climate foundation models that challenge traditional NWP models in medium-range forecasting. Relevant for oceanography through ocean–atmosphere coupling.

Aurora (Microsoft) — 1.3B parameter atmospheric model trained on diverse reanalysis and forecast data; achieves state-of-the-art 5-day forecasts
Pangu-Weather (Huawei) — 3D Earth system transformer for global weather forecasting
FourCastNet (NVIDIA) — Fourier Neural Operator for high-resolution forecasting
Application: coupled ocean–atmosphere boundary layer analysis

Aurora Pangu-Weather FourCastNet

LLMs for Geoscience & Oceanography

Large Language Models and vision-language models (VLMs) are being adapted for scientific applications — from automated literature synthesis to multimodal satellite image Q&A.

GeoGPT / OceanGPT — domain-adapted LLMs for geoscience question answering
GPT-4o / Claude / Gemini — code generation, data analysis assistance, report drafting
CLIP / RemoteCLIP — zero-shot satellite image–text retrieval
LLaVA / InternVL — multimodal VLM for satellite image captioning and analysis
Retrieval-Augmented Generation (RAG) over scientific literature

OceanGPT RemoteCLIP RAG VLM

Tools & Frameworks

Core

Scientific Python Stack

NumPy, Pandas, SciPy, Xarray, Dask — for large n-dimensional ocean datasets; Matplotlib, Seaborn, Plotly, Cartopy — for publication-quality visualization.

NumPyXarrayDaskCartopy

ML/DL

Machine & Deep Learning Libraries

Scikit-learn, XGBoost, LightGBM for classical ML. PyTorch and TensorFlow/Keras for deep learning. Hugging Face transformers and timm for pre-trained model access and fine-tuning.

PyTorchTensorFlowScikit-learnHugging Face

Geo AI

Geospatial AI & Remote Sensing Tools

TorchGeo, Segment Geospatial (samgeo), GDAL/Rasterio, Google Earth Engine Python API, PySTAC for satellite data discovery and processing pipelines.

TorchGeosamgeoGEE PythonPySTAC

HPC

High-Performance Computing

Experience running ML workloads on MSU's HPC cluster (Orion / Atlas supercomputers). SLURM job scheduling, multi-GPU training with PyTorch DDP, parallelization with Dask and Joblib.

SLURMMulti-GPUDaskPyTorch DDP

Selected Projects & Publications

2025–2026

Ensemble ML for Water Quality in Mississippi Sound

GRI, Mississippi State University

Developed ensemble ML models (RF, XGBoost, SVR) fused with Landsat and Sentinel-2 imagery to map seasonal water quality dynamics in the Western Mississippi Sound. Achieved state-of-the-art accuracy for Chl-a, turbidity, and CDOM estimation.

Ensemble MLLandsatWater Quality Publication

2025

CNN Pipeline for Imaging FlowCytobot Plankton Classification

GRI, Mississippi State University

Designed a full CNN-based workflow for automated classification of phytoplankton and harmful algal bloom species from the Imaging FlowCytobot (IFC) deployed in the Mississippi Sound. Integrated with real-time data pipelines.

CNNIFCHABsPyTorch

2025

Machine Learning for Hypoxia Prediction — Gulf of Mexico

Published in Regional Studies in Marine Science

Applied and compared multiple ML algorithms (RF, XGBoost, ANN, SVM) for spatial and temporal prediction of hypoxic zones in the northern Gulf of Mexico using satellite, buoy, and cruise data.

HypoxiaGulf of MexicoXGBoost DOI

2024

CA-ANN Modeling — Sundarbans Delta Change

Published in IEEE JSTARS

Used Cellular Automata coupled with Artificial Neural Networks to model and project land use / land cover changes and delta dynamics in the Sundarbans, Bangladesh — one of the world's largest mangrove systems.

CA-ANNSundarbansIEEE DOI

2023

Cattle Behavior Classification — HPC Internship

GRI × USDA, MSU Supercomputer

Classified cattle behavior from video and sensor data using VGG-19 and a custom CNN, trained on MSU's High-Performance Computing (HPC) cluster. Selected as one of eight interns across USA universities.

VGG-19Custom CNNHPC Project

Useful Resources & Links

Foundation Model Repositories

Learning & Courses

NASA HLS Foundation Model Workshop
NASA ARSET Remote Sensing Training
Google Earth Engine Docs
Deep Learning Specialization — deeplearning.ai
Fast.ai Practical Deep Learning for Coders
Hugging Face NLP / Vision Courses (free)