Abstract
This gaussian process case study is an extension of the Stancon talk, Failure prediction in hierarchical equipment system: spline fitting naval ship failure. Many comparison criteria exist, but in terms of prediction accuracy, gaussian model outperformed the spline model. However, this accuracy comes at a cost of more detailed and iterative checking process. This casestudy shows how identification and underfitting problems diagnosed from pushforward and predictive checks are addressed through reparameterization and adding variables. Basically our data is highly unbalanced per category with lots of missing data. Also, due to its hierarchical structure of a system, such as shared engine types, hierarchical model is applicable. For detailed explanation on the data and spline model, please refer to this notebook.\(t\): age index
\(j\): ship index
\(k[j]\): engine index
\[y_{t,j} \sim N(\mu_{t,j},\sigma)\] \[\mu_{t,j} = \mu + \theta_t^{\ age}+\theta_j^{\ ship}+\theta_{k[j]}^{\ engine}+\gamma_{t,j}+\delta_{t,\ k[j]}\]
\[\gamma_{j} \sim \text{MultiNormal}(0,K_{l^{\gamma},{\alpha}^{\gamma}})\] \[\delta_{k} \sim \text{MultiNormal}(0,K_{l^{\delta},{\alpha}^{\delta}})\] \[\theta_t^{\ age} \sim N(0, {\sigma^{age}}^2)\] \[\theta_j^{\ ship} \sim N(0,{\sigma^{ ship}}^2\ )\]
\[\theta_k^{\ engine} \sim N(0,{\sigma^{engine}}^2\ )\]
data {
int<lower=1> N;
int<lower=1> N_engines;
int<lower=1> N_ships;
int<lower=1> N_ages_obs;
int<lower=1> N_ages;
int<lower=1> ship_engine_ind[N_ships];
int<lower=1,upper=99> ship_ind[N];
int<lower=1> age_ind[N];
vector[N] y;
}
transformed data {
real ages[N_ages];
int N_comp = 10;
for (t in 1:N_ages)
ages[t] = t;
}
parameters {
matrix[N_ages,N_engines] GP_engine_std;
matrix[N_ages,N_ships] GP_ship_std;
vector[N_ages_obs] age_std;
vector[N_ships] ship_std;
vector[N_engines] engine_std;
real<lower=0> tot_var;
simplex[N_comp] prop_var;
real mu;
real<lower=0> length_GP_engine;
real<lower=0> length_GP_ship;
real <lower = 0> length_engine_scale;
real <lower = 0> length_ship_scale;
real <lower = 0> length_engine_shape;
real <lower = 0> length_ship_shape;
}
GP_engine_std
, GP_ship_std
, age_std
, ship_std
, engine_std
follow standard normal distribution. These are tools for giving Normal prior using cholesky decomposition.
tot_var
and simplex prop_var
are used to join all \(\sigma\)s in model in one vector.
mu
is \(\mu\) above. (\(\mu_{t,j} = \mu + \theta_t^{\ age}+\theta_j^{\ ship}+\theta_{k[j]}^{\ engine}+\gamma_{t,j}+\delta_{t,\ k[j]}\))
length_GP_ship
, length_GP_engine
are length-scale of covariance matrix , i.e. \(l^{\gamma}\), \(l^{\delta}\).
transformed parameters {
matrix[N_ages,N_engines] GP_engine;
matrix[N_ages,N_ships] GP_ship;
vector[N_ages_obs] age_re;
vector[N_ships] ship_re;
vector[N_engines] engine_re;
vector[N_comp] vars;
real sigma_age;
real sigma_engine;
real sigma_ship;
real sigma_error_ship;
real sigma_GP_engine;
real sigma_GP_ship;
vars = N_comp * prop_var * tot_var;
sigma_age = sqrt(vars[1]);
sigma_engine = sqrt(vars[2]);
sigma_ship = sqrt(vars[3]);
sigma_GP_engine = sqrt(vars[4]);
sigma_GP_ship = sqrt(vars[5]);
sigma_error_ship = sqrt(vars[6]);
engine_re = sigma_engine * engine_std;
age_re = sigma_age * age_std;
ship_re = sigma_ship * ship_std;
{
matrix[N_ages, N_ages] cov_engine;
matrix[N_ages, N_ages] cov_ship;
matrix[N_ages, N_ages] L_cov_engine;
matrix[N_ages, N_ages] L_cov_ship;
cov_engine = cov_exp_quad(ages, sigma_GP_engine,
length_GP_engine);
cov_ship = cov_exp_quad(ages, sigma_GP_ship,
length_GP_ship);
for (age in 1:N_ages) {
cov_engine[age, age] = cov_engine[age, age] + 1e-6;
cov_ship[age, age] = cov_ship[age, age] + 1e-6;
}
L_cov_engine = cholesky_decompose(cov_engine);
L_cov_ship = cholesky_decompose(cov_ship);
GP_engine = L_cov_engine * GP_engine_std; //f_engine
GP_ship = L_cov_ship * GP_ship_std; //f_ship
}
}
sigma_age
, sigma_engine
, sigma_ship
are \(\sigma^{age}\), \(\sigma^{engine}\), \(\sigma^{ship}\) above. \(\theta_t^{\ age} \sim N(0, {\sigma^{age}}^2)\)
sigma_GP_ship
, sigma_GP_engine
are marginal std of covariance matrix of \(\gamma_{j}\), \(\delta_{k}\) i.e, \({\alpha}^{\gamma}\), \({\alpha}^{\delta}\).
sigma_error_ship
is \(\sigma\) of \(y_{t,j} \sim N(\mu_{t,j},\sigma)\).
engine_re = sigma_engine * engine_std
means \(\theta_k^{\ engine} = \sigma^{engine}\ * N(0,1)\) so that \(\theta_k^{\ engine} \sim N(0,{\sigma^{engine}}^2\ )\)
cov_engine
, L_cov_engine
are \(K_{l^{\delta},{\alpha}^{\delta}}\) and cholesky decomposition \(L_{\delta}\) so that \(L_{\delta}L_{\delta}^T = K_{l^{\delta},{\alpha}^{\delta}}\)
GP_engine = L_cov_engine * GP_engine_std
means \(\delta_{k} = L_{\delta}*N(0,1)\) so that \(\delta_{k} \sim \text{MultiNormal}(0,K_{l^{\delta},{\alpha}^{\delta}})\)
model {
vector[N] obs_mu;
for (n in 1:N) {
obs_mu[n] = mu
+ age_re[age_ind[n]] //fixed effects
+ engine_re[ship_engine_ind[ship_ind[n]]]
+ ship_re[ship_ind[n]]
+ GP_engine[age_ind[n],ship_engine_ind[ship_ind[n]]] //f_engine
+ GP_ship[age_ind[n],ship_ind[n]]; //f_ship
}
y ~ normal(obs_mu, sigma_error_ship);
to_vector(GP_engine_std) ~ normal(0, 1);
to_vector(GP_ship_std) ~ normal(0, 1);
age_std ~ normal(0, 1);
ship_std ~ normal(0, 1);
engine_std ~ normal(0, 1);
mu ~ normal(.5, .5);
tot_var ~ normal(0,1);
length_engine_shape ~ normal(30, 1);
length_engine_scale ~ normal(8, 1);
length_ship_shape ~ normal(30, 1);
length_ship_scale ~ normal(3, 1);
length_GP_engine ~ weibull(length_engine_shape,length_engine_scale); // (30,8)
length_GP_ship ~ weibull(length_ship_shape,length_ship_scale); //(30,3)
}
above paragraph means \(\mu_{t,j} = \mu + \theta_t^{\ age}+\theta_j^{\ ship}+\theta_{k[j]}^{\ engine}+\gamma_{t,j}+\delta_{t,\ k[j]}\)
y ~ normal(obs_mu, sigma_error_ship)
means \(y_{t,j} \sim N(\mu_{t,j},\sigma)\)
GP_engine_std
, GP_engine_std
, age_std
, ship_std
, engine_std
follow standard normal because they are used to give normal dist with cholesky.
priors for \(\mu\), \(\frac{\text{sum of all $sigma$ s}}{6}\), \(l^{\gamma}\), \(l^{\delta}\) are given.
util <- new.env()
source('stan_utility.R', local=util)
source('gp_utility.R', local=util)
library(rstan); library(cmdstanr); library(parallel); library("tidyverse"); library(dplyr)
## Warning: package 'rstan' was built under R version 3.6.3
## Warning: package 'StanHeaders' was built under R version 3.6.3
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tidyverse' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.3
## Warning: package 'forcats' was built under R version 3.6.3
setwd(getwd())
set.seed(1954)
.libPaths("~/Rlib")
options(mc.cores = parallel::detectCores())
util <- new.env()
source('stan_utility.R', local=util)
source('gp_utility.R', local=util)
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_mid_trans <- c("#B97C7C80")
c_mid_highlight_trans <- c("#A2505080")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")
c_light_teal="#6B8E8E"
c_mid_teal="#487575"
c_dark_teal="#1D4F4F"
c_green_trans <- c("#00FF0080")
c_superfine <- c("#8F272705")
println <- function(msg) cat(msg); cat("\n")
printf <- function(pattern, ...) println(sprintf(pattern, ...))
print_file <- function(file) println(readLines(file))
scriptDir <- getwd()
modelDir <- file.path(scriptDir, "models")
dataDir <- file.path(scriptDir, "data")
gp_fit <- function(modelName, data){
chains <- 4
parallel_chains <- min(chains, detectCores())
scriptDir <- getwd()
delivDir <- file.path(scriptDir, "deliv", modelName)
prefit <- file.path(delivDir, paste0(modelName, ".rda"))
stanfile <- file.path(modelDir, modelName, paste0(modelName, ".stan"))
if (file.exists(prefit)){
fit <- readRDS(prefit)
}else{
mod <- cmdstan_model(stanfile, quiet = FALSE)
fit <- mod$sample(data, chains = chains, iter_warmup = 500, iter_sampling = 500,
parallel_chains = parallel_chains, save_warmup = FALSE)
dir.create(delivDir)
fit$save_object(file = prefit)
}
fit
}
div_detect <- function(stanfit){
partition <- util$partition_div(stanfit)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]
par(mfrow=c(1, 3))
plot(nondiv_samples$length_GP_engine, nondiv_samples$sigma_GP_engine, log="xy",
col=c_dark_trans, pch=16, cex=0.8, xlab="length_GP_engine", ylab="sigma_GP_engine")
points(div_samples$length_GP_engine, div_samples$sigma_GP_engine,
col=c_green_trans, pch=16, cex=0.8)
plot(nondiv_samples$length_GP_engine, nondiv_samples$sigma_error_ship,
col=c_dark_trans, pch=16, cex=0.8, xlab="length_GP_engine", ylab="sigma_error_ship")
points(div_samples$length_GP_engine, div_samples$sigma_error_ship,
col=c_green_trans, pch=16, cex=0.8)
plot(nondiv_samples$length_engine_scale, nondiv_samples$length_GP_engine_s,
col=c_dark_trans, pch=16, cex=0.8, xlab="length_engine_scale", ylab="length_GP_engine_s")
points(div_samples$length_engine_scale, div_samples$length_GP_engine_s,
col=c_green_trans, pch=16, cex=0.8)
}
data preparation
N_engines <- 5
N_ships <- 99
N_ages <- 31
N_ages_obs <- 31
ship_engine_ind <- read.csv("data/engine.csv")$engine
ship_ind <- read.csv("data/ship_index.csv")$X0
age_ind <- read.csv("data/x_age.csv", header = FALSE)[-1,1]
y <- read.csv("data/y_count_pwr.csv", header = FALSE)[,1]
data <- list(N = length(y), N_engines=N_engines,N_ships = N_ships, N_ages= N_ages, N_ages_obs = N_ages_obs,
ship_engine_ind =ship_engine_ind, ship_ind = ship_ind, age_ind = age_ind, y=y)
mseNplot <- function(x, y){
yhat<- x %>%
dplyr::filter(str_detect(variable, "y_new_pred")) %>%
pull(mean)
yhat<- (matrix(yhat, nrow = 31, ncol = 99))
y_hat <- rep(NA, length(y))
for (i in 1:length(y)){
y_hat[i] <- yhat[age_ind[i],ship_ind[i]]
}
par(mfrow=c(1,1))
plot(1, type="n",xlim=c(0,31),ylim=c(-3,4),xlab="age",ylab="scaled failure count")
for (n in 1:653){
points(age_ind[n],y[n],col="black",pch=16)
points(age_ind[n],y_hat[n],col="red")
}
mean((y-y_hat)^2)
}
We have started from this reference code and modified based on the following steps. 1.domain knowledge, 2.prior pushforwd check, 3.prior predictive check. The last two differ in that the former’s stage is parameter space and the latter is observation, i.e. y, space.
First modifications were to simplify the model.
length_engine_ship
prior parameters\(\alpha\) and \(\rho\) parameters could tuned with algebraic solvers match one’s domain knowledge (such as INLA example from Talts, 2018 or Mike Bentancourt’s casestudy) but as we did not have domain knowledge on length_engine_ship
we decided to give prior to its scale and location parameter.
length_GP_engine ~ weibull(length_engine_shape,length_engine_scale); length_GP_ship ~ weibull(length_ship_shape,length_ship_scale);
gp_hp.stan
is our baseline code.
Prametrizing prevented the need for our domain knowledge but instead, we had to set reasonable prior for length_GP_engine
and length_GP_ship
. First attemps were just to experiment with the given parmameters in the orginal model; which ended up with divergences. > length_engine_shape ~ normal(30, 1); length_engine_scale ~ normal(8, 1); length_ship_shape ~ normal(30, 1); length_ship_scale ~ normal(3, 1);
modelName <- "gp_hp"
sf_hp <- gp_fit(modelName, data)
rsf_hp <- read_stan_csv(sf_hp$output_files())
util$check_all_diagnostics(rsf_hp)
## [1] "n_eff / iter looks reasonable for all parameters"
## [1] "Rhat looks reasonable for all parameters"
## [1] "4 of 2000 iterations ended with a divergence (0.2%)"
## [1] " Try running with larger adapt_delta to remove the divergences"
## [1] "0 of 2000 iterations saturated the maximum tree depth of 10 (0%)"
## [1] "E-FMI indicated no pathological behavior"
partition <- util$partition_div(rsf_hp)
div_samples <- partition[[1]]
nondiv_samples <- partition[[2]]
par(mfrow=c(1, 3))
plot(nondiv_samples$length_GP_engine, nondiv_samples$sigma_GP_engine, log="xy",
col=c_dark_trans, pch=16, cex=0.8, xlab="length_GP_engine", ylab="sigma_GP_engine")
points(div_samples$length_GP_engine, div_samples$sigma_GP_engine,
col=c_green_trans, pch=16, cex=0.8)
plot(nondiv_samples$length_GP_engine, nondiv_samples$sigma_error_ship,
col=c_dark_trans, pch=16, cex=0.8, xlab="length_GP_engine", ylab="sigma_error_ship")
points(div_samples$length_GP_engine, div_samples$sigma_error_ship,
col=c_green_trans, pch=16, cex=0.8)
plot(nondiv_samples$length_GP_engine, nondiv_samples$length_engine_scale,
col=c_dark_trans, pch=16, cex=0.8, xlab="length_GP_engine", ylab="length_engine_scale")
points(div_samples$length_GP_engine, div_samples$length_engine_scale,
col=c_green_trans, pch=16, cex=0.8)
We observed two problems from the results.
As can be seen from the result, the value does not change much from the given prior values, which could indicate identification issues (refer to Bentancourt’s identity crisis). This might be due to the fact that data are scaled for security issues and have relatively weak impact on the prior update@. Therefore, to investigate the parameters’ true range, weak priors are given. We chose \(lognormal(0,3)\).
length_engine_scale ~ lognormal(0, 3);
length_ship_scale ~ lognormal(0, 3);
length_engine_shape ~ lognormal(0, 3);
length_ship_shape ~ lognormal(0, 3);
modelName <- "gp_hp_lognorm"
data$hp_scale <- 3
sf_hp_ln <- gp_fit(modelName, data)
rsf_hp_ln <- read_stan_csv(sf_hp_ln$output_files())
util$check_all_diagnostics(rsf_hp_ln)
## [1] "n_eff / iter looks reasonable for all parameters"
## [1] "Rhat for parameter length_GP_ship is 1.13027917437218!"
## [1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
## [1] "271 of 2000 iterations ended with a divergence (13.55%)"
## [1] " Try running with larger adapt_delta to remove the divergences"
## [1] "349 of 2000 iterations saturated the maximum tree depth of 10 (17.45%)"
## [1] " Run again with max_treedepth set to a larger value to avoid saturation"
## [1] "E-FMI indicated no pathological behavior"
div_detect(rsf_hp_ln)
emp_le_shape <-sf_hp_ln$summary("length_engine_shape")%>% pull(mean)
emp_le_scale <-sf_hp_ln$summary("length_engine_scale")%>% pull(mean)
emp_ls_shape <-sf_hp_ln$summary("length_ship_shape")%>% pull(mean)
emp_ls_scale <-sf_hp_ln$summary("length_ship_scale")%>% pull(mean)
print(c(emp_le_shape, emp_le_scale, emp_ls_shape,emp_ls_scale))
## [1] 18.725967 5.773572 11.603859 8.269962
Restricting the range of length_engine_*
and length_ship_*
is needed to address divergence problems. We decided to plug in the empirical values.
modelName <- "gp_hp_norm"
data$hp_scale <- 1
data$emp_le_shape <- emp_le_shape
data$emp_le_scale <- emp_le_scale
data$emp_ls_shape <- emp_ls_shape
data$emp_ls_scale <- emp_ls_scale
sf_hp_n <- gp_fit(modelName, data)
rsf_hp_n <- read_stan_csv(sf_hp_n$output_files())
util$check_all_diagnostics(rsf_hp_n)
## [1] "n_eff / iter looks reasonable for all parameters"
## [1] "Rhat looks reasonable for all parameters"
## [1] "0 of 2000 iterations ended with a divergence (0%)"
## [1] "0 of 2000 iterations saturated the maximum tree depth of 10 (0%)"
## [1] "E-FMI indicated no pathological behavior"
div_detect(rsf_hp_n)
emp_le_shape_2 <-sf_hp_n$summary("length_engine_shape")%>% pull(mean)
emp_le_scale_2 <-sf_hp_n$summary("length_engine_scale")%>% pull(mean)
emp_ls_shape_2 <-sf_hp_n$summary("length_ship_shape")%>% pull(mean)
emp_ls_scale_2 <-sf_hp_n$summary("length_ship_scale")%>% pull(mean)
print(c(emp_le_shape_2, emp_le_scale_2, emp_ls_shape_2,emp_ls_scale_2))
## [1] 18.707524 4.120272 11.289554 6.821292
As this resulted in less divergence, we set this value as our prior parameters for further experiment.
data$emp_le_shape <- emp_le_shape_2
data$emp_le_scale <- emp_le_scale_2
data$emp_ls_shape <- emp_ls_shape_2
data$emp_ls_scale <- emp_ls_scale_2
From the plot, our \(\tilde{y}\) seems to be different from y, real data. Below is how we updated the model iteratively to make our \(\tilde{y}\) closer to y.
sf_hp_n_sm <- sf_hp_n$summary()
mse_hp_n <- mseNplot(sf_hp_n_sm,y)
mse_hp_n
## [1] 0.3511693
For each lifetime, standard deviation for each engine looks quite different. For example, engine 4 and 5 shows large variance while engine 3 shows small variance of given data. We attempted to give different errors to each engine. \[y_{t,j} \sim Normal(\mu_{t,j},\sigma_{k[j]})\]
transformed parameters {
real sigma_error_ship -> real sigma_error_ship[5]
}
model{
for (i in 1:5)
sigma_error_ship[i] = sqrt(vars[i + 5]);
y[n] ~ normal(obs_mu[n], sigma_error_ship[ship_engine_ind[ship_ind[n]]])
}
modelName <- "gp_hp_n_5var"
sf_gp_hp_n_5var <- gp_fit(modelName, data)
rsf_gp_hp_n_5var <- read_stan_csv(sf_gp_hp_n_5var$output_files())
util$check_all_diagnostics(rsf_gp_hp_n_5var)
## [1] "n_eff / iter looks reasonable for all parameters"
## [1] "Rhat for parameter length_GP_ship is 1.12088734806734!"
## [1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
## [1] "0 of 2000 iterations ended with a divergence (0%)"
## [1] "0 of 2000 iterations saturated the maximum tree depth of 10 (0%)"
## [1] "E-FMI indicated no pathological behavior"
div_detect(rsf_gp_hp_n_5var)
sf_gp_hp_n_5var_sm<-sf_gp_hp_n_5var$summary()
mse_res_n_5var <- mseNplot(sf_gp_hp_n_5var_sm,y)
mse_res_n_5var
## [1] 0.3133182
As mse value didn’t decrease, we decided to use the same variance for each engine.
As predictive check shows, variance of observed values is larger than the variance of fitted values.(i.e. underfitting).
To prevent underfitting, we want latent variable \(\mu\) to fit better to observed data \(y\).
So we decided to give smaller prior for sigma_error_ship
. (\(\because y \sim N(\mu, \sigma_{error\ ship})\))@
sigma_error_ship
tot_var ~ normal(0,1);
vars = N_comp * prop_var * tot_var;
sigma_age = sqrt(vars[1]);
sigma_engine = sqrt(vars[2]);
sigma_ship = sqrt(vars[3]);
sigma_GP_engine = sqrt(vars[4]);
sigma_GP_ship = sqrt(vars[5]);
sigma_error_ship = sqrt(vars[6]);
Prior for sigma_error_ship
in above models is half-normal(0,1)
because \(\text{N_comp}=6\) , \(\text{prop_var} = (\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6},\frac{1}{6})\) and \(\text{tot_var} \sim half-normal(0,1)\).@
Since prior of sigma_error_ship
is half normal, giving smaller variance is equivalent to smaller expected value of half normal prior@. This is because expectation of half-normal distribution \(X\sim half-N(\mu,\sigma^2)\) is $ $.
sigma_error_ship ~ normal(0,1) -> sigma_error_ship ~ normal(0,0.1)
modelName <- "gp_underfit"
sf_underfit <- gp_fit(modelName, data)
rsf_underfit <- read_stan_csv(sf_underfit$output_files())
util$check_all_diagnostics(rsf_underfit)
## [1] "n_eff / iter looks reasonable for all parameters"
## [1] "Rhat for parameter length_GP_ship is 1.1048461924472!"
## [1] " Rhat above 1.1 indicates that the chains very likely have not mixed"
## [1] "0 of 2000 iterations ended with a divergence (0%)"
## [1] "0 of 2000 iterations saturated the maximum tree depth of 10 (0%)"
## [1] "E-FMI indicated no pathological behavior"
div_detect(rsf_underfit)
sf_underfit_sm<-sf_underfit$summary()
mse_res_underfit <- mseNplot(sf_underfit_sm,y)
mse_res_underfit
## [1] 0.3226397
MSE decreased, so we decided to use smaller prior on sigma_error_ship
.
length_GP_engine
, and therefore used hyperparameters.length_engine_scale
,length_ship_scale
, length_engine_shape
, length_ship_shape
were calibrated based on divergence and pairplots. normal priors with mean learned from weak lognormal prior were given.length_GP_engine
. We gave smaller prior on sigma_error_ship
to prevent underfitting. This model gave the smallest MSE.Further developement are as follows:
Trangucci (2017). Hierarchical Gaussian processes in Stan
Margossian, C. C., Vehtari, A., Simpson, D., & Agrawal, R. (2020). Hamiltonian Monte Carlo using an adjoint-differentiated Laplace approximation. arXiv preprint arXiv:2004.12550.