An exploratory Bayesian network for estimating the magnitudes and uncertainties of selected water-quality parameters at streamgage 03374100 White River at Hazleton, Indiana, from partially observed data
Scientific Investigations Report 2018-5053
National Water-Quality Program
- David J. Holtschlag
An exploratory discrete Bayesian network (BN) was developed to assess the potential of this type of model for estimating the magnitudes and uncertainties of an arbitrary subset of unmeasured water-quality parameters given the measured complement of parameters historically measured at a U.S. Geological Survey streamgage. Water-quality data for 27 water-quality parameters from 596 discrete measurements at U.S. Geological Survey streamgage 03374100 White River at Hazleton, Indiana, were used to develop this BN. Data for each of the water-quality parameters were discretized into five intervals based on the quintiles of the measured values. The 596 discrete measurements were randomly partitioned into a training set with 80 percent of the data and a testing set with 20 percent of the data to identify, estimate, and assess the training and testing accuracy of the Bayesian network.
A BN with 28 nodes was formed from the 27 water-quality parameters and the month of sample collection. Based on data in the training set, a network with 53 directed edges and month as the target node was identified by minimizing the negative log-likelihood function for all nodes treated, in turn, as the target variable. The edge structure determines the number and magnitude of elements in conditional probability tables associated with all nodes.
The effectiveness of the BN was assessed on the basis of correct classification rates to one of the five discrete intervals, which were computed separately for the training and testing datasets and for two conditioning variable sets. The selected sets of conditioning variables represent two of many possible sets of measured parameters on which to base estimates of unmeasured parameters. The first set includes only the month of sample collection (month), and an expanded set includes month and six other continuously measurable parameters, referred to as the ContMeasSet, all of which were obtained from the discrete data.
Results indicated that the training dataset had average correct classification rates of 41.7- and 61.2-percent rates conditioned on the month and ContMeasSet sets, respectively. The testing dataset had somewhat lower average correct classification rates of 40.8 and 56.5 percent for the two conditioning variable sets. When conditioned on month only, the average correct classification rate for the testing dataset was only slightly lower than the average correct classification rate in the training dataset, indicating little model overfitting. When using the ContMeasSet, however, the average decrease in accuracy between training and testing sets was 4.9 percent. The training and testing datasets and both sets of conditioning variables, however, indicate that the BN would substantially outperform a random assignment model, which would be expected to have a 20-percent correct classification rate. In addition, the edge structure of the BN depicts how information can flow through the network, which may help prioritize parameters for measurement to facilitate estimation of unmeasured parameters. Finally, extension of a static BN, like the one developed in this report, to a dynamic BN may provide a basis for using high-frequency or continuous water-quality data to extend information in time between discrete water-quality samples, and this integration could mitigate some of the limitations of high-frequency and discrete water-quality sampling methods.
Holtschlag, D.J., 2018, An exploratory Bayesian network for estimating the magnitudes and uncertainties of selected water-quality parameters at streamgage 03374100 White River at Hazleton, Indiana, from partially observed data: U.S. Geological Survey Scientific Investigations Report
2018–5053, 30 p., https://doi.org/10.3133/sir20185053.
ISSN: 2328-0328 (online)
Table of Contents
- Methods of Bayesian Network Analysis
- Implementing a Bayesian Network for Water-Quality Data
- Computing Magnitudes and Uncertainties of Selected Parameters
- Classification Rates for the Bayesian Network
- Application Potential
- Summary and Conclusions
- References Cited
Additional publication details
- Publication type:
- Publication Subtype:
- USGS Numbered Series
- An exploratory Bayesian network for estimating the magnitudes and uncertainties of selected water-quality parameters at streamgage 03374100 White River at Hazleton, Indiana, from partially observed data
- Series title:
- Scientific Investigations Report
- Series number:
- Year Published:
- U.S. Geological Survey
- Publisher location:
- Reston, VA
- Contributing office(s):
- National Water Quality Program
- Report: vii, 30 p.; Data release
- United States
- Other Geospatial:
- White River
- Online Only (Y/N):
- Additional Online Files (Y/N):