A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17

Ryan R. McShane; Cheryl A. Eddy-Miller

doi:10.3133/sir20215093

A Machine Learning Approach to Modeling Streamflow with Sparse Data in Ungaged Watersheds on the Wyoming Range, Wyoming, 2012–17

Scientific Investigations Report 2021-5093

By: Ryan R. McShane and Cheryl A. Eddy-Miller

https://doi.org/10.3133/sir20215093

Links

Document: Report (2.75 MB pdf) , XML
Dataset: U.S. Geological Survey National Water Information System database — USGS water data for the Nation
Data Release: USGS data release — Input data, model output, and R scripts for a machine learning streamflow model on the Wyoming Range, Wyoming, 2012–17
Download citation as: RIS | Dublin Core

Abstract

Scant availability of streamflow data can impede the utility of streamflow as a variable in ecological models of aquatic and terrestrial species, especially when studying small streams in watersheds that lack streamgages. Streamflow data at fine resolution and broad extent were needed by collaborators for ecological research on small streams in several ungaged watersheds of southwestern Wyoming, where streamflow data are sparse.

To improve the utility of sparse streamflow data to ecological research in ungaged watersheds, we developed a machine learning approach in R for modeling spatially and temporally continuous monthly streamflow from 2012 through 2017 in three semiarid montane-steppe watersheds (with drainage areas of 26–55 square miles and mean elevations of 8,031–8,455 feet) on the Wyoming Range in the upper Green River Basin. A machine learning streamflow (MLFLOW) model was calibrated and validated with 971 discrete streamflow observations and 24 static and dynamic predictor variables derived from geospatial and time series data on climatic, physiographic, and anthropogenic characteristics affecting streamflow. The predictor variables were temporally and spatially conditioned to amplify the relation of predictor variables to monthly streamflow.

The MLFLOW model had satisfactory agreement between observed and predicted streamflow (coefficient of determination [R²]=0.80, Nash-Sutcliffe efficiency [NSE]=0.79, NSE with log-transformed data [logNSE]=0.82, and percent bias [PBIAS]=0.7 percent). NSE and logNSE indicated the MLFLOW model performed equally well for high and low flows, and PBIAS indicated the MLFLOW model did not overpredict or underpredict monthly streamflow. Streamflow predictions seemed to well represent the annual hydrograph within the study area during the study period.

The most important variables (statistically important in the MLFLOW model) for explaining monthly streamflow were temporally and spatially conditioned dynamic climatic variables, mostly precipitation and snow water equivalent. Importance of the static and dynamic variables did not differ substantially among the three watersheds but differed considerably among the 6 years. Monthly streamflow increased with increasing precipitation, snow water equivalent, and drainage area but decreased with increasing forest cover, elevation, evapotranspiration, and temperature.

The MLFLOW model was most sensitive to selection of dynamic climatic variables. Unconditioned dynamic climatic variables alone explained 54 percent of the variance (R²=0.54) in monthly streamflow, whereas adding static physiographic and anthropogenic variables only explained 12 percent more of the variance (R²=0.66). Also, spatial conditioning of all variables together with temporal conditioning of dynamic variables increased the variance explained in the MLFLOW model by another 14 percent (R²=0.80). The MLFLOW model also had greater sensitivity to temporal than to spatial differences in the data. For the MLFLOW model trained with observations from all watersheds and years or for models trained with observations from all except one watershed or 1 year left out sequentially, performance was better in testing on observations from each watershed than from each year separately. Also, performance was better for models fitted to fewer sites than to fewer months of observations.

The greatest utility of the modeling approach is the ease of use and the speed of processing input data, running the model, and interpreting the model output, whereas the greatest limitation is the need for spatially and temporally representative streamflow observations to drive the model. Although familiarity with R is necessary, only a working knowledge of hydrology (for selecting appropriate predictor variables and evaluating the quality of streamflow observations) and a rudimentary understanding of machine learning models are needed. Therefore, this modeling approach is practicable for other scientists who work with water but who are not hydrologists.

Suggested Citation

McShane, R.R., and Eddy-Miller, C.A., 2021, A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17: U.S. Geological Survey Scientific Investigations Report 2021–5093, 29 p., https://doi.org/10.3133/sir20215093.

ISSN: 2328-0328 (online)

Study Area

Additional publication details
Publication type	Report
Publication Subtype	USGS Numbered Series
Title	A machine learning approach to modeling streamflow with sparse data in ungaged watersheds on the Wyoming Range, Wyoming, 2012–17
Series title	Scientific Investigations Report
Series number	2021-5093
DOI	10.3133/sir20215093
Year Published	2021
Language	English
Publisher	U.S. Geological Survey
Publisher location	Reston, VA
Contributing office(s)	WY-MT Water Science Center
Description	Report: viii, 29 p.; Data Release; Dataset
Country	United States
State	Wyoming
Other Geospatial	Wyoming Range
Online Only (Y/N)	Y
Google Analytic Metrics	Metrics page

A Machine Learning Approach to Modeling Streamflow with Sparse Data in Ungaged Watersheds on the Wyoming Range, Wyoming, 2012–17

Links

Abstract

Suggested Citation

Study Area

Table of Contents