A Machine Learning Approach to Modeling Streamflow with Sparse Data in Ungaged Watersheds on the Wyoming Range, Wyoming, 2012–17

Scientific Investigations Report 2021-5093
By:  and 

Links

Abstract

Scant availability of streamflow data can impede the utility of streamflow as a variable in ecological models of aquatic and terrestrial species, especially when studying small streams in watersheds that lack streamgages. Streamflow data at fine resolution and broad extent were needed by collaborators for ecological research on small streams in several ungaged watersheds of southwestern Wyoming, where streamflow data are sparse.

To improve the utility of sparse streamflow data to ecological research in ungaged watersheds, we developed a machine learning approach in R for modeling spatially and temporally continuous monthly streamflow from 2012 through 2017 in three semiarid montane-steppe watersheds (with drainage areas of 26–55 square miles and mean elevations of 8,031–8,455 feet) on the Wyoming Range in the upper Green River Basin. A machine learning streamflow (MLFLOW) model was calibrated and validated with 971 discrete streamflow observations and 24 static and dynamic predictor variables derived from geospatial and time series data on climatic, physiographic, and anthropogenic characteristics affecting streamflow. The predictor variables were temporally and spatially conditioned to amplify the relation of predictor variables to monthly streamflow.

The MLFLOW model had satisfactory agreement between observed and predicted streamflow (coefficient of determination [R2]=0.80, Nash-Sutcliffe efficiency [NSE]=0.79, NSE with log-transformed data [logNSE]=0.82, and percent bias [PBIAS]=0.7 percent). NSE and logNSE indicated the MLFLOW model performed equally well for high and low flows, and PBIAS indicated the MLFLOW model did not overpredict or underpredict monthly streamflow. Streamflow predictions seemed to well represent the annual hydrograph within the study area during the study period.

The most important variables (statistically important in the MLFLOW model) for explaining monthly streamflow were temporally and spatially conditioned dynamic climatic variables, mostly precipitation and snow water equivalent. Importance of the static and dynamic variables did not differ substantially among the three watersheds but differed considerably among the 6 years. Monthly streamflow increased with increasing precipitation, snow water equivalent, and drainage area but decreased with increasing forest cover, elevation, evapotranspiration, and temperature.

The MLFLOW model was most sensitive to selection of dynamic climatic variables. Unconditioned dynamic climatic variables alone explained 54 percent of the variance (R2=0.54) in monthly streamflow, whereas adding static physiographic and anthropogenic variables only explained 12 percent more of the variance (R2=0.66). Also, spatial conditioning of all variables together with temporal conditioning of dynamic variables increased the variance explained in the MLFLOW model by another 14 percent (R2=0.80). The MLFLOW model also had greater sensitivity to temporal than to spatial differences in the data. For the MLFLOW model trained with observations from all watersheds and years or for models trained with observations from all except one watershed or 1 year left out sequentially, performance was better in testing on observations from each watershed than from each year separately. Also, performance was better for models fitted to fewer sites than to fewer months of observations.

The greatest utility of the modeling approach is the ease of use and the speed of processing input data, running the model, and interpreting the model output, whereas the greatest limitation is the need for spatially and temporally representative streamflow observations to drive the model. Although familiarity with R is necessary, only a working knowledge of hydrology (for selecting appropriate predictor variables and evaluating the quality of streamflow observations) and a rudimentary understanding of machine learning models are needed. Therefore, this modeling approach is practicable for other scientists who work with water but who are not hydrologists.

Suggested Citation

McShane, R.R., and Eddy-Miller, C.A., 2021, A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17: U.S. Geological Survey Scientific Investigations Report 2021–5093, 29 p., https://doi.org/10.3133/sir20215093.

ISSN: 2328-0328 (online)

Study Area

Table of Contents

  • Acknowledgments
  • Abstract
  • Introduction
  • Methods for Machine Learning Approach to Modeling Streamflow
  • Results of Machine Learning Approach to Modeling Streamflow
  • Summary
  • References Cited
Publication type Report
Publication Subtype USGS Numbered Series
Title A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17
Series title Scientific Investigations Report
Series number 2021-5093
DOI 10.3133/sir20215093
Year Published 2021
Language English
Publisher U.S. Geological Survey
Publisher location Reston, VA
Contributing office(s) WY-MT Water Science Center
Description Report: viii, 29 p.; Data Release; Dataset
Country United States
State Wyoming
Other Geospatial Wyoming Range
Online Only (Y/N) Y
Google Analytic Metrics Metrics page
Additional publication details