Online Program

Return to main conference page
Thursday, May 17
Machine Learning Applications
Thu, May 17, 6:15 PM - 7:15 PM
Regency Ballroom B
 

Predicting Human Alteration of River and Stream Salinity Using Random Forest Models (304722)

Presentation

Steven Kim, California State University, Monterey Bay, Department of Mathematics and Statistics 
John R. Olson, California State University, Monterey Bay, School of Natural Science 
*Franco Alexis Sanchez, California State University, Monterey Bay, Department of Mathematics and Statistics  

Keywords: predictive modeling, random forest, principle component analysis, variable selection, water chemistry, salinity, specific conductivity

Salt concentrations in streams (measured as Specific Conductivity [SC] in µS/cm) are essential in assessing the aquatic conditions of our national river systems. SC values > 3000 µS/cm indicate salt pollution, which leads to degradation of environmental conditions and infrastructure. Our two objectives were to (1) evaluate how changes in watershed SC is related to natural landscape features and human disturbance across the contiguous U.S. using an empirical model and (2) assess the effects of drought on SC. We calculated human alteration of SC by subtracting previous modeled estimates of naturally occurring SC from monthly SC observations (n = 1082181) from January of 2000 to December of 2015. Each observed alteration was then matched to 131 predictors characterizing upstream natural and human environmental factors. We modeled the association between alteration and environment using a random forest model (ntrees = 500) in R. Due to limited computation power we used only 10% of the processed data (n = 68513, p = 131, chosen by a spatially stratified random sample) in our initial model. We used a principle component analysis (PCA) to reduce the number of variables retained in the final model. The PCA axes were rotated using the varimax rotation, and then the variables with the highest loading and the strongest univariate relationship on each axis were selected. We then examined partial dependence plots and selected the strongest predictors for the model (p = 43). The chosen predictive model had reasonable performance [R^2 = 0.664, RMSE = 0.666]. We validated this test model using external validation data. We can now map predicted alteration and have identified the human activities that had the most impact on salinity across the nation. Current limitations lie in computation power, which may be addressed with high-end computers or utilizing R packages optimized for big data. These are the authors' views and do not necessarily represent views or policies of U.S. EPA.