Predict which parameter values will lead to a failure of JULES-ES-1.0 to produce a credible land surface simulation.
JULES-ES-1.0 has a number of parameters, the values of which are uncertain, that can be changed to alter the behaviour of the model (and resulting land surface simulation).
Choosing random but initially plausible values of these parameters can lead the model to crash (denoted CRASHED), or to produce land surface simulations that contain very little or no carbon (denoted LOWCARBON). Knowing regions of parameter space that would lead to plausible simulations would help us avoid unnecessary computational expense, and help us understand which regions of parameter space we might include to make plausible projections of the future land surface.
Data is from McNeall et al (2024) Constraining the carbon cycle of JULES-ES-1.0
Useful tutorial for random forests in R:
https://www.r-bloggers.com/2021/04/random-forest-in-r/
set.seed(42)
library(RColorBrewer)
library(e1071)
library(MASS)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
# Summary data frame
load('data/wave0_summary.RData')
d <- ncol(X)
lowcarbon_ix <- which(Y_char =='LOWCARBON')
crashed_ix <- which(Y_char == 'CRASHED')
x1 <- X[lowcarbon_ix , ]
x2 <- X[crashed_ix ,]
XY <- rbind(x1, x2)
pairs(XY,
lower.panel=function(x, y, ...) {
Xx <- x[seq_len(nrow(x1))]
Xy <- y[seq_len(nrow(x1))]
points(Xx, Xy, col = 'blue', pch = 19, cex = 0.8)
},
upper.panel=function(x, y, ...) {
Yx <- x[(nrow(x1) + 1):length(x)]
Yy <- y[(nrow(x1) + 1):length(y)]
points(Yx, Yy, col = 'red', pch = 19, cex = 0.8)
},
gap = 0,
xlim = c(0,1), ylim = c(0,1),
labels = 1:d,
oma = c(2, 18, 2, 2)) # move the default tick labels off the plot
reset <- function()
{
# Allows annotation of graphs, resets axes
par(mfrow=c(1, 1), oma=rep(0, 4), mar=rep(0, 4), new=TRUE)
plot(0:1, 0:1, type="n", xlab="", ylab="", axes=FALSE)
}
reset()
legend('left', legend = paste(1:d, colnames(lhs)), cex = 1.1, bty = 'n')
legend('topleft', pch = 19, col = c('red', 'blue'), legend = c('CRASHED', 'LOWCARBON'), bty = 'n', inset = 0.02, cex = 1.1 )
## Split the data into train and test sets
data_df <- data.frame(X, y = as.factor(Y_char))
ix_train <- 1:399
ix_test <- 400:499
train <- data_df[ix_train, ]
test <- data_df[ix_test, ]
There are a number of ways to alter this random forest, to try and make it better. A simple way is to tune “mtry”, away from the standard number of sqrt(ncol(X)).
In this version, we’ve stratified and oversampled the rarer classes, in order to better highlight CRASHED cases. These tend not to be predicted at all in a standard random forest.
rf <- randomForest(y~., data=train, mtry = 13, proximity=TRUE, strata = train$y, sampsize = c(10, 10, 10))
print(rf)
##
## Call:
## randomForest(formula = y ~ ., data = train, mtry = 13, proximity = TRUE, strata = train$y, sampsize = c(10, 10, 10))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 13
##
## OOB estimate of error rate: 18.05%
## Confusion matrix:
## CRASHED LOWCARBON RAN class.error
## CRASHED 6 2 10 0.6666667
## LOWCARBON 8 46 5 0.2203390
## RAN 31 16 275 0.1459627
p1 <- predict(rf, train)
confusionMatrix(p1, train$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CRASHED LOWCARBON RAN
## CRASHED 18 1 20
## LOWCARBON 0 58 12
## RAN 0 0 290
##
## Overall Statistics
##
## Accuracy : 0.9173
## 95% CI : (0.8858, 0.9424)
## No Information Rate : 0.807
## P-Value [Acc > NIR] : 6.977e-10
##
## Kappa : 0.7841
##
## Mcnemar's Test P-Value : 3.221e-07
##
## Statistics by Class:
##
## Class: CRASHED Class: LOWCARBON Class: RAN
## Sensitivity 1.00000 0.9831 0.9006
## Specificity 0.94488 0.9647 1.0000
## Pos Pred Value 0.46154 0.8286 1.0000
## Neg Pred Value 1.00000 0.9970 0.7064
## Prevalence 0.04511 0.1479 0.8070
## Detection Rate 0.04511 0.1454 0.7268
## Detection Prevalence 0.09774 0.1754 0.7268
## Balanced Accuracy 0.97244 0.9739 0.9503
The random forest has trouble correctly predicting the CRASHED ensemble members.
p2 <- predict(rf, test)
confusionMatrix(p2, test$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction CRASHED LOWCARBON RAN
## CRASHED 2 4 12
## LOWCARBON 2 16 4
## RAN 2 2 56
##
## Overall Statistics
##
## Accuracy : 0.74
## 95% CI : (0.6427, 0.8226)
## No Information Rate : 0.72
## P-Value [Acc > NIR] : 0.37481
##
## Kappa : 0.489
##
## Mcnemar's Test P-Value : 0.03713
##
## Statistics by Class:
##
## Class: CRASHED Class: LOWCARBON Class: RAN
## Sensitivity 0.3333 0.7273 0.7778
## Specificity 0.8298 0.9231 0.8571
## Pos Pred Value 0.1111 0.7273 0.9333
## Neg Pred Value 0.9512 0.9231 0.6000
## Prevalence 0.0600 0.2200 0.7200
## Detection Rate 0.0200 0.1600 0.5600
## Detection Prevalence 0.1800 0.2200 0.6000
## Balanced Accuracy 0.5816 0.8252 0.8175
We can see that the classifier assigns non-zero probabilities to the CRASHED ensemble members.
p3 <- predict(rf, test, type = 'prob')
head(data.frame(p3, test$y))
## CRASHED LOWCARBON RAN test.y
## 400 0.148 0.636 0.216 CRASHED
## 401 0.304 0.220 0.476 RAN
## 402 0.254 0.178 0.568 RAN
## 403 0.314 0.218 0.468 RAN
## 404 0.350 0.190 0.460 RAN
## 405 0.182 0.614 0.204 LOWCARBON
mat_unif <- function(nr, nc, cn = NULL){
# Function for sampling from a uniform hypercube
nsamp <- nc*nr
samp <- runif(nsamp)
out <- matrix(data = samp, nrow = nr, ncol = nc)
colnames(out) <- cn
out
}
xpred <- mat_unif(nr = 500, nc = ncol(X), cn = colnames(X))
Predict at the sampled inputs
p4 <- predict(rf, xpred, type = 'prob')
library(viridis)
## Loading required package: viridisLite
library(fields)
## Loading required package: spam
## Loading required package: dotCall64
## Loading required package: grid
## Spam version 2.7-0 (2021-06-25) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
## See https://github.com/NCAR/Fields for
## an extensive vignette, other supplements and source code
cols_crashed <- viridis(10)[as.numeric(cut(p4[,1],breaks = 10))]
cols_lowcarbon <- viridis(10)[as.numeric(cut(p4[,2],breaks = 10))]
cols_ran <- viridis(10)[as.numeric(cut(p4[,3],breaks = 10))]
par(mfrow = c(2,2), oma = c(6, 0.1, 0.1, 0.1))
#plot(dat$x,dat$y,pch = 20,col = dat$Col)
plot(xpred[,4], xpred[,8],
xlab = colnames(X)[4], ylab = colnames(X)[8],
col = cols_crashed,
pch = 19,
main = 'CRASHED')
plot(xpred[,4], xpred[,8],
col = cols_lowcarbon,
xlab = colnames(X)[4], ylab = colnames(X)[8],
pch = 19,
main = 'LOWCARBON')
plot(xpred[,4], xpred[,8],
xlab = colnames(X)[4], ylab = colnames(X)[8],
col = cols_ran,
pch = 19,
main = 'RAN')
reset()
image.plot(legend.only = TRUE,
zlim = c(0,1),
col = viridis(10),
legend.args = list(text = 'Probability', side = 3, line = 1),
legend.shrink = 0.5,
horizontal = TRUE)
pairs(xpred,
gap = 0.2,
col = cols_ran,
pch = 19,
cex = 0.5,
lower.panel = NULL,
labels = 1:d,
xaxt = 'n', yaxt = 'n'
)
reset()
legend('left', legend = paste(1:d, colnames(lhs)), cex = 1.1, bty = 'n')
library(fields)
image.plot(legend.only = TRUE,
zlim = c(0,1),
col = viridis(10),
legend.args = list(text = 'Probability that model runs', side = 3, line = 1),
legend.shrink = 0.7,
horizontal = TRUE)
plot(rf)
t <- tuneRF(train[,-33], train[,33],
stepFactor = 0.3,
plot = TRUE,
ntreeTry = 200,
trace = TRUE,
improve = 0.05)
## mtry = 5 OOB error = 7.52%
## Searching left ...
## mtry = 17 OOB error = 7.02%
## 0.06666667 0.05
## Warning in randomForest.default(x, y, mtry = mtryCur, ntree = ntreeTry, :
## invalid mtry: reset to within valid range
## mtry = 57 OOB error = 7.52%
## -0.07142857 0.05
## Searching right ...
## mtry = 1 OOB error = 18.8%
## -1.678571 0.05
hist(treesize(rf),
main = "No. of Nodes for the Trees",
col = "grey")
#Variable Importance
varImpPlot(rf,
sort = T,
n.var = 10,
main = "Top 10 - Variable Importance")
importance(rf)
## MeanDecreaseGini
## alpha_io 0.4744212
## a_wl_io 0.4672476
## bio_hum_cn 0.8078075
## b_wl_io 1.7535666
## dcatch_dlai_io 0.3403083
## dqcrit_io 0.7014155
## dz0v_dh_io 0.4423858
## f0_io 3.2470792
## fd_io 0.4312871
## g_area_io 0.4137554
## g_root_io 0.4642400
## g_wood_io 0.4419942
## gs_nvg_io 0.4457127
## hw_sw_io 0.5695595
## kaps_roth 0.6141545
## knl_io 0.4595796
## lai_max_io 0.4852672
## lai_min_io 0.4840112
## lma_io 0.4944905
## n_inorg_turnover 0.4231661
## nmass_io 0.3910695
## nr_io 0.5204867
## retran_l_io 0.4619497
## retran_r_io 0.5417704
## r_grow_io 0.3719335
## rootd_ft_io 0.8503770
## sigl_io 0.4537905
## sorp 0.5984660
## tleaf_of_io 0.3701894
## tlow_io 0.5260909
## tupp_io 0.6705525
## l_vg_soil 0.2818741
#MeanDecreaseGini
partialPlot(rf, train, f0_io)
partialPlot(rf, train, b_wl_io)
MDSplot(rf, train$y)
# This function does a simple bootstrap test of the randomForest algorithm
# for a set number of ensemble members (ntrain). It splits the data into a training
# set of ntrain rows and a test set of ntest rows, trains the model, predicts the
# test set and records the misclassification rate for each rep in the oputput.
# boot_rf <- function(data, ntrain, ntest, nreps){
#
# outvec <- rep(NA, nreps)
#
# for(i in 1:nreps){
#
# n_train_and_test <- ntrain+ntest
#
# all_samp_ix <- sample(1:nrow(X), n_train_and_test)
# train_ix <- all_samp_ix[1:ntrain]
# test_ix <- all_samp_ix[(ntrain+1):n_train_and_test]
#
# train <- data[train_ix, ]
# test <- data[test_ix ,]
#
#
# rf <- randomForest(y~., data=train, proximity=TRUE)
#
#
# pred_test <- predict(rf, test)
# conf_test <- confusionMatrix(pred_test, test$y)
#
# out <- (1 - conf_test$overall['Accuracy'])
# outvec[i] <- out
#
# }
#
# outvec
#
# }
# test the above
# test_boot <- boot_rf(data = data_df, ntrain = 100, ntest = 10, nreps = 10)
# nreps <- 50
#
# nens_vec <- seq(from = 100, to = 400, by = 20)
#
# outmat <- matrix(nrow = nreps, ncol = length(nens_vec))
#
# for(j in 1:length(nens_vec)){
#
# boot_rf_out <- boot_rf(data = data_df, ntrain = nens_vec[j], ntest = 50, nreps = nreps)
#
# outmat[, j] <- boot_rf_out
#
# }
# 1- accuracy mean
# oma_mean <- apply(outmat,2,mean)
#
# plot(nens_vec, oma_mean, type = 'b' )
# boxplot( outmat)