- AGUCurating flood extent data and leveraging citizen science for benchmarking machine learning solutionsShubhankar Gahlot, Muthukumaran Ramasubramanian, Iksha Gurung, and 3 more authorsESS Open Archive eprints, Apr 2022
We present a labeled machine learning (ML) training dataset derived from Sentinel 1 C-band synthetic aperture radar (SAR) data for flood events. In this paper, we detail the steps to collect, pre-process, label, curate, and catalog the training dataset. Development of benchmark ML models and usage of the training datasets for a data science competition are also presented.
- AGUVerb Sense Disambiguation for Densifying Knowledge Graphs in Earth ScienceAshish Acharya, Carson Davis, Derek Koehl, and 4 more authorsIn AGU Fall Meeting Abstracts, Dec 2021
Knowledge graphs are graphical representations of knowledge using entities and their relationships. Properly structured, they can be a powerful way to surface latent relationships among well- defined entities. By breaking down sentences into their semantic components, an Earth science corpus can be represented as a graph, with verbs acting as the relationship edges between entity nodes. This allows domain information to be captured with high precision. However, since there are multiple verbs in English that can be used to denote the same meaning, this high precision comes at the cost of sparsity of connections and query results. In this presentation, we show a technique where we disambiguate the meaning of a verb in a given sentence using word2vec, with the aim to consolidate it into one among a limited number of synonym sets. This leads to a denser graph and more matches for a given query.
- AGUMachine Learning pipeline for Earth Science using SagemakerIksha Gurung, Muthukumaran Ramasubramanian, Shubhankar Gahlot, and 3 more authorsIn AGU Fall Meeting Abstracts, Dec 2021
Machine learning has risen to the forefront of solving various problems in scientific research. They differ from traditional problem solving by modeling presented data using stochastic processes as opposed to deterministic processes. This presents unique challenges to overcome for successful implementation, namely data parallelization and scalable computation. While machine learning algorithms are being widely adopted across the scientific community, setting up scalable data and computation environments are increasingly becoming a barrier to overcome. Modern cloud providers offer services that address these challenges by automatically provisioning the environment, enabling the scientists to focus solely on the algorithm details. With this presentation, we show how SageMaker, a service from AWS that aims to accelerate ML research, can be used for an Earth Science use-case. In addition, We also showcase ImageLabeler: A cloud native labeling tool for labeling Earth Science Events. The tool simplifies importing of labeled datasets into cloud environments such as Sagemaker.
- AGUA Novel Machine Learning Method for Surface PM2.5 Estimations from Geostationary SatellitesGeorge Priftis, Aaron Kaulfus, Muthukumaran Ramasubramanian, and 6 more authorsIn AGU Fall Meeting Abstracts, Dec 2021
Particulate matter (PM) with a diameter of less or equal to 2.5 m, known as PM2.5, affects human health as it penetrates the respiratory system. The Environmental Protection Agency (EPA) measures the atmospheric concentration of PM2.5 using air quality monitors stationed throughout the Continental United States (CONUS). Such measurements are points on a spatial domain and therefore, might not be representative of the air quality at nearby areas considering that the composition of the atmosphere is highly variable from place to place. Satellite based AOD permits a spatially uniform means of estimating PM2.5 and new geostationary satellites provide high temporal and spatial resolution estimation of AOD. However, the concentration of PM2.5 is non-linearly dependent on other atmospheric parameters that include relative humidity, temperature, and height of the planetary boundary layer. This information may be estimated at similar spatial and temporal resolutions as AOD from numerical modeling such as from the National Oceanic and Atmospheric Administrations (NOAA) High Resolution Rapid Refresh (HRRR) model which resolves near real-time atmospheric conditions over the CONUS. The estimation of PM2.5 concentration is a multi- parametric problem that considers the effect of temporal dependencies among the different parameters. Deep learning approaches are appropriate for such complex estimation problems as they intrinsically capture relations among multiple non- linear parameters. This study compares deep-learning methods to traditional regression analysis to demonstrate the capabilities of these methods in predicting PM2.5 concentrations. Additionally, a novel ensemble learning approach is employed to identify scientific processes that could further improve the estimation of PM2.5 concentration. Utilizing Long Short-Term Memory (LSTM) neural networks, which are suitable for multivariate time series estimation problems as they are capable of learning long-term dependencies, individual models are created for each EPA station and trained on the aforementioned dataset collocated over each station. Individual station models are merged if the model’s performance is improved by reducing the root mean squared error (RMSE) metric. This ensemble training method ultimately reduces the RMSE value. Evaluation of these results provide insights into physical processes and related observable parameters that may contribute to PM2.5 concentrations. Identified parameters evaluated to be statistically different between the merged and unmerged models are expected to improve overall performance. These new parameters are then utilized for reevaluation of the deep learning methods with an extreme gradient boosting model with an RMSE of 5.5 providing the best results.
- AGULeveraging citizen science and Artificial intelligence for monitoring and estimating hazardous eventsShubhankar Gahlot, Muthukumaran Ramasubramanian, Iksha Gurung, and 4 more authorsIn AGU Fall Meeting Abstracts, Dec 2021
Floods and hurricanes are few of the major natural disasters that cause immense damage to property and lives every year. Therefore, knowing the true extent of these disasters is crucial for emergency management and resource allocation not only by federal agencies like the Federal Emergency Management Agency (FEMA) but also by local authorities and nonprofits. Monitoring these events in-situ is difficult as it is hazardous to operate in a disaster zone. Remote sensing, in conjunction with machine learning, has been used extensively in the community to monitor these events. Finding a machine learning solution is an exhaustive process. Citizen science has been used extensively to find the best solution for problems in both scientific and commercial sectors. As part of incorporating citizen science for detecting and estimating extents of natural disasters, we hosted competitions to involve the broader science community to estimate the hurricane wind speeds and flood extents based on satellite images. In this presentation, we will discuss the methods used to generate the datasets, results from the competition, and the lessons learned.
- CSCIData optimization for large batch distributed training of deep neural networksShubhankar Gahlot, Junqi Yin, and Mallikarjun Arjun ShankarIn 2020 International Conference on Computational Science and Computational Intelligence (CSCI), Dec 2020
Distributed training in deep learning (DL) is common practice as data and models grow. The current practice for distributed training of deep neural networks faces the challenges of communication bottlenecks when operating at scale, and model accuracy deterioration with an increase in global batch size. Present solutions focus on improving message exchange efficiency as well as implementing techniques to tweak batch sizes and models in the training process. The loss of training accuracy typically happens because the loss function gets trapped in a local minima. We observe that the loss landscape minimization is shaped by both the model and training data and propose a data optimization approach that utilizes machine learning to implicitly smooth out the loss landscape resulting in fewer local minima. Our approach filters out data points which are less important to feature learning, enabling us to speed up the training of models on larger batch sizes to improved accuracy.
- LINCChanging the state of literacy in the Digital Age in IndiaAanandita Gahlot, and Shubhankar GahlotIn Proceedings of the MIT LINC 2019 Conference, Dec 2020
India as an emerging economy deals with troubles in literacy due to factors like shortage of quality academic institutions and unsuitable curriculum. Digital Technology is accredited as something which can bridge the gap between quality institutions and individuals and make learning more engaging. Indian Government has made use of technology in the best possible way and launched Pradhan Mantri Gramin Digital Saksharta Abhiyan (PMGDISHA)‡ under its Digital India initiative. It has been initiated to make at least one individual from each household digitally literate so that they develop the skills which will be needed to link with the rapidly growing digital world. This scheme aims to target the rural population including the disparaged sections of society like minorities, Below Poverty Line (BPL), women and differently-abled people. The use of technology in education has transmuted the whole system of education. This paper is aimed at exploring the changing state of literacy in India after introducing PMGDISHA.
- IEEE/ACM DLSStrategies to Deploy and Scale Deep Learning on the Summit SupercomputerJunqi Yin, Shubhankar Gahlot, Nouamane Laanait, and 4 more authorsIn 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), Nov 2019
The rapid growth and wide applicability of Deep Learning (DL) frameworks poses challenges to computing centers which need to deploy and support the software, and also to domain scientists who have to keep up with the system environment and scale up scientific exploration through DL. We offer recommendations for deploying and scaling DL frameworks on the Summit supercomputer, currently atop the Top500 list, at the Oak Ridge National Laboratory Leadership Computing Facility (OLCF). We discuss DL software deployment in the form of containers, and compare performance of native-built frameworks and containerized deployment. Software containers show no noticeable negative performance impact and exhibit faster Python loading times and promise easier maintenance. To explore strategies for scaling up DL model training campaigns, we assess DL compute kernel performance, discuss and recommend I/O data formats and staging, and identify communication needs for scalable message exchange for DL runs at scale. We recommend that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice. We present baseline examples of scaling efficiency 87% for a DL run of ResNet50 running on 1024 nodes (6144 V100 GPUs).