The Sum-to-Zero Constraint in Stan

Author

Mitzi Morris

Load libraries and set up notebook.

Show Code
# libraries used in this notebook
import os
import numpy as np
import pandas as pd
import geopandas as gpd
import plotnine as p9
import libpysal
from splot.libpysal import plot_spatial_weights 
from random import randint

from cmdstanpy import CmdStanModel, cmdstan_path, set_cmdstan_path
import logging
cmdstanpy_logger = logging.getLogger("cmdstanpy")
cmdstanpy_logger.setLevel(logging.FATAL)
set_cmdstan_path(os.path.join('/Users', 'mitzi', 'github', 'stan-dev', 'cmdstan'))

import warnings
warnings.filterwarnings('ignore')

import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline

from utils_html import *

# notebook display options
plt.rcParams['figure.figsize'] = (7, 7)
plt.rcParams['figure.dpi'] = 100

np.set_printoptions(precision=2)
np.set_printoptions(suppress=True)
pd.set_option('display.precision', 2)
pd.options.display.float_format = '{:.2f}'.format

Introducing the sum_to_zero_vector Constrained Parameter Type

The sum_to_zero_vector constrained parameter type was introduced in the Stan 2.36 release.

The parameter declaration:

  sum_to_zero_vector[K] beta;

produces a vector of size K such that sum(beta) = 0. The unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.

If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do

parameters {
  sum_to_zero_vector[N] z;
}
nnmodel {
  z ~ normal(0, sqrt((N / (N - 1.0)) * sigma)
}

where sigma is the intended standard deviation. (Note: it’s slightly more efficient to pre-calculate the sqrt(N / (N - 1.0)) in transformed_data.)

Prior to Stan 2.36, a sum-to-zero constraint could be implemented either

  • as a “hard” sum to zero constraint, where the parameter is declared to be an \(N-1\) length vector with a corresponding \(N\)-length transformed parameter whose first \(N-1\) elements are the same as the corresponding parameter vector, and the \(N^{th}\) element is the negative sum of the \(N-1\) elements.

  • as a “soft” sum to zero constraint, where the parameter is an \(N\)-length vector with a prior on the sum of the elements that keeps it within \(\epsilon\) of \(0\).

The choice between the two was largely an empirical one. As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient; for larger vectors, the soft sum-to-zero constraint is faster, but much depends on the specifics of the model and the data.

The hard sum-to-zero is often problematic. For small \(N\) and models with sensible priors, the hard sum-to-zero is usually satisfactory. But as the size of the vector grows, it distorts the marginal variance of the \(N^{th}\). Given a parameter vector: \[ x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2) \] by the properties of independent normal variables, each of the free elements \(x_1, \ldots, x_{N-1}\) has variance \(\sigma^2\). However, the \(N^{th}\) element is defined deterministically as: \[ x_N = -\sum_{i=1}^{N-1} x_i \] and its variance is inflated by a factor of \(N-1\). \[ \operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr) = \sum_{i=1}^{N-1} \operatorname{Var}(x_i) = (N-1)\sigma^2. \] For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint, as every change to any of the \(N-1\) elements also requires a corresponding change to the \(N^{th}\) element; balancing these changes introduces potential non-identifiabilities.

The soft sum-to-zero constraint is problematic for the following reasons.

  • The tolerance \(\epsilon\) (the scale of the penalty) must be chosen by the analyst. Too large, and the result is too far from zero to be effective, too small and the sampler cannot satisfy the constraint.
  • The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters. This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
  • The marginal variances may not reflect the intended prior.

The sum_to_zero_vector transform ensures that each element of the resulting constrained vector has the same variance. This improves the sampler performance, providing fast computation and good effective sample size. This becomes increasingly noticeable as models increase in size and complexity. To demonstrate this, in this notebook we consider two different classes of models:

  • Multi-level regressions for binomial data with group-level categorical predictors.
  • Spatial models for areal data.

For these models, we provide three implementations which differ only in the implementation of the sum-to-zero constraint: the built-in sum_to_zero_vector, and the hard and soft sum-to-zero implementations. We fit each model to the same dataset, using the same random seed, and then compare the summary statistics for the constrained parameter values. Since the models are nearly equivalent except in the implied prior distributions on the vector, we expect that all three implementations should produce nearly equivalent estimates; what differs is the speed of computation, as measured by effective samples per second.

Note

The spatial models are taken from the a set of notebooks available from GitHub repo https://github.com/mitzimorris/geomed_2024.

Multi-level Models with Group-level Categorical Predictors

In this section we consider a model which estimates per-demographic disease prevalence rates for a population. The model is taken from the Gelman and Carpenter, 2020 Bayesian Analysis of Tests with Unknown Specificity and Sensitivity. It combines a model for multilevel regression and post-stratification with a likelihood that accounts for test sensitivity and specificity.

The data consists of:

  • A set of per-demographic aggregated outcomes of a diagnostic test procedure with unequal number of tests per demographic.

  • A corresponding set of demographic descriptors encoded as a vector of categorical values. In this example these are named sex, age, eth, and edu, but there can be any number of demographic predictors with any number of categories.

  • The specified test sensitivity and specificity

Putting a sum-to-zero constraint on the categorical allows us to model all categories as offset from the mean. We prefer this to the alternative of dropping a category for reference or other contrast coding strategy.

The Stan model

The full model is in file binomial_4_preds_ozs.stan. It provides an estimate of the true prevalence based on binary tests with a given (or unknown) test sensitivity and specificity, and uses a sum_to_zero_vector to constrain the group-level parameters age, eth, and edu.

In order to put a standard normal prior on beta_age, beta_eth, and beta_edu, we need to scale the variance, as suggested above. The scaling factors are pre-computed in the transformed data block, and applied as part of the prior.

relevant excepts from binomial_4_preds_ozs.stan

transformed data {
  // scaling factors for marginal variances of sum_to_zero_vectors
  real s_age = sqrt(N_age / (N_age - 1.0));
  real s_eth = sqrt(N_eth / (N_eth - 1.0));
  real s_edu = sqrt(N_edu / (N_edu - 1.0));
}
parameters {
  real beta_0;
  real beta_sex;
  real<lower=0> sigma_age, sigma_eth, sigma_edu;
  sum_to_zero_vector[N_age] beta_age;
  sum_to_zero_vector[N_eth] beta_eth;
  sum_to_zero_vector[N_edu] beta_edu;
}
transformed parameters {
  // true prevalence
  vector[N] p = inv_logit(beta_0 + beta_sex * sex_c + beta_age[age]
              + beta_eth[eth] + beta_edu[edu]);
  // incorporate test sensitivity and specificity.
  vector[N] p_sample = p * sens + (1 - p) * (1 - spec);
}
model {
  pos_tests ~ binomial(tests, p_sample);  // likelihood
  // scale normal priors on sum_to_zero_vectors
  beta_age ~ normal(0, s_age * sigma_age);
  beta_eth ~ normal(0, s_eth * sigma_eth);
  beta_edu ~ normal(0, s_edu * sigma_edu);
  ...
}

Simulating Data

To investigate the predictive behavior of this model at different timepoints in a pandemic, we have written a data-generating program to create datasets given the baseline disease prevalence, test specificity and sensitivity, and the desired number of diagnostic tests. In the generated quantities block we use Stan’s PRNG functions to populate vthe true weights for the categorical coefficient vectors, and the relative percentages of per-category observations. Then we use a set of nested loops to generate the data for each demographic, using the PRNG equivalent of the model likelihood.

The full data-generating program is in file gen_binomial_4_preds.stan. The helper function simulate_data in file utils.py sets up the data-generating program according to the specified number of demographic categories, approximate average number of observations per category, and baseline disease prevalence, test specificity and sensitivity. This allows us to create datasets for large and small populations and for finer or more coarse-grained sets of categories. The larger the number of strata overall, the more observations are needed to get good coverage. Because the modeled data pos_tests is generated according to the Stan model’s likelihood, the model is a priori well-specified with respect to the data.

Note

Because the data-generating parameters and percentage of observations per category are generated at random, some datasets may have very low overall disease rates and/or many unobserved strata. While this is informative for understanding what is consistent with the generating distributions, it results in estimates of varying effect parameters with near-zero values, which are on the boundary of parameter space, and hence very challenging to fit. Because we wish to compare baseline performance, we have cherry-picked a seed which generates datasets that avoid this issue.

Generate several datasets of different sizes.

Show Code
# specify data shapes and test characteristics
N_eth = 3
N_edu = 5
N_age = 9
baseline = -3.5
sens = 0.75
spec = 0.9995


# generate datasets
from utils import simulate_data
data_tiny = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 4, seed=56789)
data_small = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 17, seed=data_tiny['seed'])
data_large = simulate_data(N_eth, N_edu, N_age, baseline, sens, spec, 200, seed=data_tiny['seed'])

The data generating program creates unequal numbers of tests per demographic. The following tables summarize the resulting distribution.

Show Code
print("Small Dataset")
print("tests total " + str(np.sum(data_small['tests'])))
small_df = pd.DataFrame({'tests per demographic': data_small['tests']}).describe()
small_df.drop('count', inplace=True)
small_df
Small Dataset
tests total 4451
tests per demographic
mean 16.49
std 18.51
min 0.00
25% 5.00
50% 10.00
75% 21.75
max 119.00
Show Code
print("Large Dataset")
print("tests total " + str(np.sum(data_large['tests'])))
large_df = pd.DataFrame({'tests per demographic': data_large['tests']}).describe()
large_df.drop('count', inplace=True)
large_df
Large Dataset
tests total 53863
tests per demographic
mean 199.49
std 218.00
min 4.00
25% 61.25
50% 123.00
75% 258.50
max 1401.00
Show Code
print("Tiny Dataset")
print("tests total " + str(np.sum(data_tiny['tests'])))
tiny_df = pd.DataFrame({'tests per demographic': data_tiny['tests']}).describe()
tiny_df.drop('count', inplace=True)
tiny_df
Tiny Dataset
tests total 947
tests per demographic
mean 3.51
std 4.38
min 0.00
25% 1.00
50% 2.00
75% 5.00
max 28.00

Model Fitting

We fit each model to the the large and small datasets. We record the seed used for the first run and use it for all subsequent fits; this insures that the models have the same set of initial parameter values for all parameters except the constrained params beta_eth, beta_edu, and beta_age.

Model 1: sum_to_zero_vector

This model is in file binomial_4_preds_ozs.stan.

Compile and fit the model to data.

Show Code
binomial_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs.stan'))
print("Model 1:  sum_to_zero_vector, small dataset")
binomial_ozs_fit = binomial_ozs_mod.sample(data=data_small, parallel_chains=4)

a_seed = binomial_ozs_fit.metadata.cmdstan_config['seed']
print("Model 1:  sum_to_zero_vector, large dataset")
binomial_ozs_fit_lg = binomial_ozs_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)

Model 2: Hard sum-to-zero constraint

This model is in file binomial_4_preds_hard.stan.

Compile and fit the model to data.

Show Code
binomial_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard.stan'))
print("\nModel 2:  hard sum-to-zero constraint, small dataset")
binomial_hard_fit = binomial_hard_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
print("Model 2:  hard sum-to-zero constraint, large dataset")
binomial_hard_fit_lg = binomial_hard_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)

Model 3: soft sum-to-zero constraint

This model is in file binomial_4_preds_soft.stan.

Compile and fit the model to data.

Show Code
binomial_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_soft.stan'))
print("\nModel 3:  soft sum-to-zero constraint, small dataset")
binomial_soft_fit = binomial_soft_mod.sample(data=data_small, parallel_chains=4, seed=a_seed)
print("Model 3:  soft sum-to-zero constraint, large dataset")
binomial_soft_fit_lg = binomial_soft_mod.sample(data=data_large, parallel_chains=4, seed=a_seed)

Model Checking and Comparison

Check implementations

If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models, fit to the large and small datasets.

Tabulate summary statistics from all fits.

Show Code
# small dataset
ozs_fit_summary = binomial_ozs_fit.summary(sig_figs=2)
ozs_fit_summary.index =  ozs_fit_summary.index.astype(str) + "  - ozs"


hard_fit_summary = binomial_hard_fit.summary(sig_figs=2)
hard_fit_summary.index = hard_fit_summary.index.astype(str) + " - hard"

soft_fit_summary = binomial_soft_fit.summary(sig_figs=2)
soft_fit_summary.index = soft_fit_summary.index.astype(str) + " - soft"

small_data_fits_summary = pd.concat([ozs_fit_summary, hard_fit_summary, soft_fit_summary])

# large dataset
ozs_fit_lg_summary = binomial_ozs_fit_lg.summary(sig_figs=2)
ozs_fit_lg_summary.index =  ozs_fit_lg_summary.index.astype(str) + "  - ozs"

hard_fit_lg_summary = binomial_hard_fit_lg.summary(sig_figs=2)
hard_fit_lg_summary.index = hard_fit_lg_summary.index.astype(str) + " - hard"

soft_fit_lg_summary = binomial_soft_fit_lg.summary(sig_figs=2)
soft_fit_lg_summary.index = soft_fit_lg_summary.index.astype(str) + " - soft"

large_data_fits_summary = pd.concat([ozs_fit_lg_summary, hard_fit_lg_summary, soft_fit_lg_summary])

Eth

Fitted estimates

Show Code
beta_eth_summary = summarize_predictor(small_data_fits_summary, 'beta_eth\[')
beta_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_eth\[')
labels = expand_true(data_small['beta_eth'], "Datagen<br>values")
small_html = style_dataframe(beta_eth_summary, 3).to_html()
large_html = style_dataframe(beta_eth_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_eth_summary = summarize_predictor(small_data_fits_summary, 'sigma_eth')
sigma_eth_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_eth')
small_html = style_dataframe(sigma_eth_summary, 4).to_html()
large_html = style_dataframe(sigma_eth_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
 
Datagen
values
1.97
 
 
-0.80
 
 
0.75
 
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_eth[1] - ozs 1.40 0.10 2800.00 1.00
beta_eth[1] - hard 1.40 0.10 1900.00 1.00
beta_eth[1] - soft 1.40 0.10 98.00 1.00
beta_eth[2] - ozs -1.50 0.08 3100.00 1.00
beta_eth[2] - hard -1.50 0.08 1900.00 1.00
beta_eth[2] - soft -1.50 0.08 100.00 1.00
beta_eth[3] - ozs 0.10 0.08 3200.00 1.00
beta_eth[3] - hard 0.10 0.08 5300.00 1.00
beta_eth[3] - soft 0.10 0.08 130.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_eth[1] - ozs 1.30 0.03 2700.00 1.00
beta_eth[1] - hard 1.30 0.03 1900.00 1.00
beta_eth[1] - soft 1.30 0.03 460.00 1.00
beta_eth[2] - ozs -1.50 0.02 3200.00 1.00
beta_eth[2] - hard -1.50 0.02 1800.00 1.00
beta_eth[2] - soft -1.50 0.02 530.00 1.00
beta_eth[3] - ozs 0.17 0.02 3600.00 1.00
beta_eth[3] - hard 0.18 0.02 5100.00 1.00
beta_eth[3] - soft 0.18 0.02 570.00 1.00
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_eth - ozs 1.10 0.40 3600.00 1.00
sigma_eth - hard 1.30 0.44 3100.00 1.00
sigma_eth - soft 1.20 0.41 180.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_eth - ozs 1.00 0.38 3500.00 1.00
sigma_eth - hard 1.30 0.45 3300.00 1.00
sigma_eth - soft 1.20 0.40 880.00 1.00

Edu

Fitted estimates

Show Code
beta_edu_summary = summarize_predictor(small_data_fits_summary, 'beta_edu\[')
beta_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_edu\[')
labels = expand_true(data_small['beta_edu'], "Datagen<br>values")
small_html = style_dataframe(beta_edu_summary, 3).to_html()
large_html = style_dataframe(beta_edu_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_edu_summary = summarize_predictor(small_data_fits_summary, 'sigma_edu')
sigma_edu_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_edu')
small_html = style_dataframe(sigma_edu_summary, 4).to_html()
large_html = style_dataframe(sigma_edu_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
 
Datagen
values
0.24
 
 
-0.22
 
 
0.87
 
 
-0.85
 
 
1.50
 
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_edu[1] - ozs -0.18 0.12 3600.00 1.00
beta_edu[1] - hard -0.19 0.12 2700.00 1.00
beta_edu[1] - soft -0.18 0.13 160.00 1.00
beta_edu[2] - ozs -0.36 0.12 3900.00 1.00
beta_edu[2] - hard -0.37 0.12 2500.00 1.00
beta_edu[2] - soft -0.36 0.12 150.00 1.00
beta_edu[3] - ozs 0.63 0.11 3100.00 1.00
beta_edu[3] - hard 0.62 0.11 2000.00 1.00
beta_edu[3] - soft 0.63 0.10 130.00 1.00
beta_edu[4] - ozs -1.20 0.21 2500.00 1.00
beta_edu[4] - hard -1.20 0.21 1600.00 1.00
beta_edu[4] - soft -1.20 0.21 95.00 1.00
beta_edu[5] - ozs 1.10 0.12 2700.00 1.00
beta_edu[5] - hard 1.10 0.11 2100.00 1.00
beta_edu[5] - soft 1.10 0.11 120.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_edu[1] - ozs -0.05 0.04 4100.00 1.00
beta_edu[1] - hard -0.05 0.04 2700.00 1.00
beta_edu[1] - soft -0.05 0.04 680.00 1.00
beta_edu[2] - ozs -0.53 0.03 4000.00 1.00
beta_edu[2] - hard -0.53 0.03 2400.00 1.00
beta_edu[2] - soft -0.53 0.03 750.00 1.00
beta_edu[3] - ozs 0.60 0.03 3400.00 1.00
beta_edu[3] - hard 0.60 0.03 2300.00 1.00
beta_edu[3] - soft 0.60 0.03 550.00 1.00
beta_edu[4] - ozs -1.20 0.06 3100.00 1.00
beta_edu[4] - hard -1.20 0.06 1700.00 1.00
beta_edu[4] - soft -1.20 0.06 410.00 1.00
beta_edu[5] - ozs 1.20 0.03 3100.00 1.00
beta_edu[5] - hard 1.20 0.03 1900.00 1.00
beta_edu[5] - soft 1.20 0.03 530.00 1.00
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_edu - ozs 0.84 0.29 2900.00 1.00
sigma_edu - hard 0.85 0.32 2300.00 1.00
sigma_edu - soft 0.92 0.31 130.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_edu - ozs 0.89 0.31 3200.00 1.00
sigma_edu - hard 0.88 0.31 2700.00 1.00
sigma_edu - soft 0.96 0.30 860.00 1.00

Age

Fitted estimates

Show Code
beta_age_summary = summarize_predictor(small_data_fits_summary, 'beta_age\[')
beta_age_summary_lg = summarize_predictor(large_data_fits_summary, 'beta_age\[')
labels = expand_true(data_small['beta_age'], "Datagen<br>values")
small_html = style_dataframe(beta_age_summary, 3).to_html()
large_html = style_dataframe(beta_age_summary_lg, 3).to_html()
display_side_by_side(small_html, large_html, labels)

sigma_age_summary = summarize_predictor(small_data_fits_summary, 'sigma_age')
sigma_age_summary_lg = summarize_predictor(large_data_fits_summary, 'sigma_age')
small_html = style_dataframe(sigma_age_summary, 4).to_html()
large_html = style_dataframe(sigma_age_summary_lg, 4).to_html()
display_side_by_side(small_html, large_html)
 
Datagen
values
-0.74
 
 
-1.43
 
 
-1.67
 
 
1.00
 
 
-0.52
 
 
-0.77
 
 
0.96
 
 
1.31
 
 
0.25
 
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_age[1] - ozs -0.78 0.21 3900.00 1.00
beta_age[1] - hard -0.78 0.21 3100.00 1.00
beta_age[1] - soft -0.78 0.20 150.00 1.00
beta_age[2] - ozs -1.20 0.56 1200.00 1.00
beta_age[2] - hard -1.20 0.54 1200.00 1.00
beta_age[2] - soft -1.10 0.53 78.00 1.00
beta_age[3] - ozs -1.30 0.25 4500.00 1.00
beta_age[3] - hard -1.30 0.24 2400.00 1.00
beta_age[3] - soft -1.30 0.25 150.00 1.00
beta_age[4] - ozs 1.20 0.15 2000.00 1.00
beta_age[4] - hard 1.20 0.14 1700.00 1.00
beta_age[4] - soft 1.20 0.15 110.00 1.00
beta_age[5] - ozs 0.04 0.27 3400.00 1.00
beta_age[5] - hard 0.03 0.28 2600.00 1.00
beta_age[5] - soft 0.03 0.28 160.00 1.00
beta_age[6] - ozs -0.79 0.19 2600.00 1.00
beta_age[6] - hard -0.79 0.19 2300.00 1.00
beta_age[6] - soft -0.79 0.19 140.00 1.00
beta_age[7] - ozs 0.94 0.13 1700.00 1.00
beta_age[7] - hard 0.95 0.13 1600.00 1.00
beta_age[7] - soft 0.94 0.13 100.00 1.00
beta_age[8] - ozs 1.20 0.16 1900.00 1.00
beta_age[8] - hard 1.20 0.16 1900.00 1.00
beta_age[8] - soft 1.20 0.16 110.00 1.00
beta_age[9] - ozs 0.69 0.18 2400.00 1.00
beta_age[9] - hard 0.72 0.19 2300.00 1.00
beta_age[9] - soft 0.69 0.19 140.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
beta_age[1] - ozs -0.59 0.06 4400.00 1.00
beta_age[1] - hard -0.59 0.06 2500.00 1.00
beta_age[1] - soft -0.59 0.06 610.00 1.00
beta_age[2] - ozs -1.30 0.16 1200.00 1.00
beta_age[2] - hard -1.30 0.16 1100.00 1.00
beta_age[2] - soft -1.30 0.16 290.00 1.00
beta_age[3] - ozs -1.50 0.08 4400.00 1.00
beta_age[3] - hard -1.50 0.08 2700.00 1.00
beta_age[3] - soft -1.50 0.08 730.00 1.00
beta_age[4] - ozs 1.20 0.04 2400.00 1.00
beta_age[4] - hard 1.20 0.04 1600.00 1.00
beta_age[4] - soft 1.20 0.04 410.00 1.00
beta_age[5] - ozs -0.28 0.09 3200.00 1.00
beta_age[5] - hard -0.28 0.09 2500.00 1.00
beta_age[5] - soft -0.28 0.09 730.00 1.00
beta_age[6] - ozs -0.62 0.05 2800.00 1.00
beta_age[6] - hard -0.62 0.06 2700.00 1.00
beta_age[6] - soft -0.62 0.05 620.00 1.00
beta_age[7] - ozs 1.10 0.04 1900.00 1.00
beta_age[7] - hard 1.10 0.04 1300.00 1.00
beta_age[7] - soft 1.10 0.04 370.00 1.00
beta_age[8] - ozs 1.50 0.05 2100.00 1.00
beta_age[8] - hard 1.50 0.05 1700.00 1.00
beta_age[8] - soft 1.50 0.05 430.00 1.00
beta_age[9] - ozs 0.45 0.06 2400.00 1.00
beta_age[9] - hard 0.45 0.06 2200.00 1.00
beta_age[9] - soft 0.45 0.06 580.00 1.00
 
Small Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_age - ozs 1.00 0.27 2400.00 1.00
sigma_age - hard 1.10 0.29 2200.00 1.00
sigma_age - soft 1.00 0.27 120.00 1.00
Large Dataset
  Mean StdDev ESS_bulk/s R_hat
sigma_age - ozs 1.10 0.25 3400.00 1.00
sigma_age - hard 1.20 0.28 3200.00 1.00
sigma_age - soft 1.10 0.26 760.00 1.00

These comparisons demonstrate that

  • All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.

  • The models recover the sign of the parameter, but not the exact value. In the case of lots of observations and only a few categories they do a better job of recovering the true parameters.

  • In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures. In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low.

  • The sum_to_zero_vector implementation has the highest number of effective samples per second, excepting a few individual parameters for which the hard sum-to-zero performs equally well.

Model efficiency

Model efficiency is measured by iterations per second, however, as the draws from the MCMC sampler may be correlated, we need to compute the number of effective samples across all chains divided by the total sampling time - this is ESS_bulk/s, the effective samples per second. The following table shows the average runtime for 100 runs of each of the three models on large and small datasets. This data was generated by script eval_efficiencies.py.

Tabulate summary statistics from all fits.

Show Code
with open(os.path.join('binomial_runtimes_small.json'), 'r') as file:
     data = json.load(file)
df_small = pd.DataFrame(data)

with open(os.path.join('binomial_runtimes_large.json'), 'r') as file:
     data = json.load(file)
df_large = pd.DataFrame(data)

small_html = style_dataframe(df_small, 3).to_html()
large_html = style_dataframe(df_large, 3).to_html()
display_side_by_side(small_html, large_html)
 
Small Dataset
  runtime, ave runtime, std dev ESS_bulk/s
ozs small 2.21 0.05 1327.11
hard small 3.07 0.08 988.47
soft small 54.49 2.54 63.14
Large Dataset
  runtime, ave runtime, std dev ESS_bulk/s
ozs large 2.41 0.08 1305.86
hard large 3.25 0.09 981.53
soft large 17.41 0.50 181.15

The sum_to_zero_vector is far more efficient than either the hard or soft sum-to-zero constraint.

Prior predictive checks

Prior predictive checks simulate data directly from the prior, in the absense of any observed data. Here we use the simulated dataset to examine the prior marginal variances of the elements of the sum-to-zero vector under the hard-sum-to-zero constraint and the built-in sum_to_zero transform. In particular, we are interested in the marginal variances of the elements of the sum-to-zero constrained parameters.

We have written two models: binomial_4preds_ozs_ppc.stan and binomial_4preds_hard_ppc.stan. We generate a sample from each, and compare the marginal variances of the elements of sum-to-zero constrained parameter beta_age.

Marginal variances of the hard sum-to-zero constraint, sample from the prior

Generate sample from the prior of the hard sum-to-zero model.

Show Code
binomial_hard_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_hard_ppc.stan'))
binomial_hard_ppc_fit = binomial_hard_ppc_mod.sample(data=data_small, parallel_chains=4)

Get marginal variances of elements of parameter beta_age from the prior.

Show Code
np.var(binomial_hard_ppc_fit.beta_age, axis=0)
array([1.04, 1.19, 1.03, 1.22, 1.23, 1.07, 0.96, 1.02, 9.06])

The above summaries show how the hard sum-to-zero constraint distorts the variance of the \(N^{th}\) element. This is only a problem for very sparse datasets, where the prior swamps the data. With data, this problem goes away; to see this we fit the hard sum-to-zero constraint model on the tiny and small datasets.

Marginal variances of the hard sum-to-zero constraint, tiny dataset

Fit hard sum-to-zero model to the tiny dataset.

Show Code
binomial_hard_fit_tiny = binomial_hard_mod.sample(data=data_tiny, parallel_chains=4)

Get marginal variances of elements of parameter beta_age from the posterior.

Show Code
np.var(binomial_hard_fit_tiny.beta_age, axis=0)
array([0.18, 0.86, 0.25, 0.09, 0.31, 0.15, 0.07, 0.11, 0.16])

Marginal variances of the hard sum-to-zero constraint, small dataset

Fit hard sum-to-zero model to the small dataset.

Show Code
binomial_hard_fit_small = binomial_hard_mod.sample(data=data_small, parallel_chains=4)

Get marginal variances of elements of parameter beta_age from the posterior.

Show Code
np.var(binomial_hard_fit_small.beta_age, axis=0)
array([0.04, 0.31, 0.06, 0.02, 0.08, 0.03, 0.02, 0.03, 0.04])

The marginal variances for the sum_to_zero_vector elements are roughly equal in both the prior and the posterior.

Marginal variances of the built-in sum_to_zero_vector, sample from the prior

Generate sample from the prior of the sum_to_zero_vector model.

Show Code
binomial_ozs_ppc_mod = CmdStanModel(stan_file=os.path.join('stan', 'binomial_4preds_ozs_ppc.stan'))
binomial_ozs_ppc_fit = binomial_ozs_ppc_mod.sample(data=data_small, parallel_chains=4)

Get marginal variances of elements of parameter beta_age from the prior.

Show Code
np.var(binomial_ozs_ppc_fit.beta_age, axis=0)
array([0.08, 0.08, 0.17, 0.12, 0.25, 0.11, 0.12, 0.11, 0.12])

Marginal variances of the built-in sum_to_zero_vector, tiny dataset

Fit the sum_to_zero_vector model to the tiny dataset.

Show Code
binomial_hard_fit_tiny = binomial_hard_mod.sample(data=data_tiny, parallel_chains=4)

Get marginal variances of elements of parameter beta_age from the posterior.

Show Code
np.var(binomial_hard_fit_tiny.beta_age, axis=0)
array([0.18, 0.81, 0.24, 0.09, 0.33, 0.15, 0.07, 0.1 , 0.16])

Discussion

Comparison of these three models with different data regimes demonstrates the following.

  • For a multi-level model with group-level categorical predictors the sum_to_zero_vector provides fast results and good effective sample sizes for both datasets.

  • Model binomial_4preds_ozs.stan. shows how to properly scale the variance of a sum_to_zero_vector constrained parameter in order to put a standard normal prior on it.

  • Prior predictive checks demonstrate the difference between the marginal variances of the sum_to_zero_vector and hard sum-to-zero implementations.

Spatial Models with an ICAR component

Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models, first introduced by Besag, 1974, account for this by pooling information from neighboring regions. The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model plus ICAR component for spatial auto-correlation by adding an ordinary random-effects component for non-spatial heterogeneity. The BYM2 model builds on this model and subsequent refinements.

The ICAR, BYM2, and BYM2_multicomp models are more fully explained in a series of notebooks available from GitHub repo: https://github.com/mitzimorris/geomed_2024, see notebooks:

ICAR Models

The key element of the BYM2 model is the ICAR component. Its conditional specification is a multivariate normal random vector \(\mathbf{\phi}\) where each \({\phi}_i\) is conditional on the values of its neighbors.

The joint specification rewrites to a Pairwise Difference,

\[ p(\phi) \propto \exp \left\{ {- \frac{1}{2}} \sum_{i \sim j}{({\phi}_i - {\phi}_j)}^2 \right\} \]

Each \({({\phi}_i - {\phi}_j)}^2\) contributes a penalty term based on the distance between the values of neighboring regions. However, \(\phi\) is non-identifiable, constant added to \(\phi\) washes out of \({\phi}_i - {\phi}_j\). Therefore, a sum-to-zero constraint is needed to both identify and center \(\phi\).

The Stan implementation of the ICAR component computes the sum of the pairwise distances by representing the spatial adjacency matrix as a array of pairs of neighbor indices.

data {
  ...
  // spatial structure
  int<lower = 0> N_edges;  // number of neighbor pairs
  array[2, N_edges] int<lower = 1, upper = N> neighbors;  // columnwise adjacent

The ICAR prior comes into the model as parameter phi.

model {
  ...
  target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR prior

The ICAR model requires that the neighbor graph is fully connected for two reasons:

  • The joint distribution is computed from the pairwise differences between a node and its neighbors; singleton nodes have no neighbors and are therefore undefined.

  • Even if the graph doesn’t have any singleton nodes, when the graph has multiple connected components a sum-to-zero constraint on the entire vector fails to properly identify the model.

Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph. In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes.

First we use the BYM2 model for fully connected graphs to compare implementations of the sum-to-zero vector on phi, and then we extend the BYM2 model following the INLA implementation developed by Freni-Sterrantino et al. in 2018 and presented in A note on intrinsic Conditional Autoregressive models for disconnected graphs.

The dataset we’re using is that used in the analysis published in 2019 Bayesian Hierarchical Spatial Models: Implementing the Besag York Mollié Model in Stan. It data consists of motor vehicle collisions in New York City, as recorded by the NYC Department of Transportation, between the years 2005-2014, restricted to collisions involving school age children 5-18 years of age as pedestrians. Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census, using the 2010 Census block map for New York City. New York City is comprised of several large and small islands, a chunk of the mainland (the Bronx), and the southern tip of Long Island Sound (Brooklyn and Queens). To analyze the full New York City dataset requires the extended BYM2 model, which we do in the second part of this section.

Model 1: The BYM2 model, Riebler et al. 2016

We compare three implementations of the BYM2 model which differ only in their implementation of phi, which is constrained to sum to zero.

BYM2 v1: sum_to_zero_vector

Parameter phi is declared as a sum_to_zero_vector. See file bym2_ozs.stan.

BYM2 v2: Hard sum-to-zero constraint

Parameter phi_raw is an vector of size N-1 and the N-length vector phi is computed in the transformed parameters block. See file bym2_hard.stan.

BYM2 v3: soft sum-to-zero constraint

Parameter phi is a vector of size N and its sum is soft-centered on \(0\). See file bym2_soft.stan.

Data assembly

The inputs to the BYM2 model are

  • The Poisson regression data

    • int<lower=0> N - number of regions
    • array[N] int<lower=0> y - per-region count outcome
    • vector<lower=0>[N] E - the population of each region (a.k.a. “exposure”),
    • int<lower=1> K - the number of predictors
    • matrix[N, K] xs - the design matrix
  • The spatial structure

    • int<lower = 0> N_edges - the number of neighbor pairs
    • array[2, N_edges] int<lower = 1, upper = N> neighbors - the graph structure
    • real tau - the scaling factor, introduced in the BYM2
Note

The scaling factor tau was introduced by Riebler et al so that the variance of the spatial and ordinary random effects are both approximately equal to 1, thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance. We have written a helper function called get_scaling_factor, in file utils_bym2.py which takes as its argument the neighbor graph and computes the geometric mean of the corresponding adjacency matrix.

To assemble the spatial data from the US Census data maps we use a set of utilities described more fully in notebooks 2 and 5 of the geomed_2024 Spatial Data Processing in Stan workshop. The neighbors graph of New York City consists of several large connected components, notably Brooklyn and Queens, the Bronx, Manhattan, and Staten Island, and a few other smaller components and singletons. The helper function nyc_soft_by_comp_size adds component info to the geodataframe and sorts the data by component size.

Get NYC dataset, pre-process as needed for BYM2 model, display neighbors graph.

Show Code
from utils_nyc_map import nyc_sort_by_comp_size
nyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))
(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)

from splot.libpysal import plot_spatial_weights 
print("New York City neighbors graph")
plot_spatial_weights(nyc_nbs, nyc_gdf)
New York City neighbors graph

The BYM2 model requires that the neighbors graph be fully connected - use the largest subcomponent Brooklyn and Queens.

Show Code
from libpysal.weights import Queen
brklyn_qns_gdf = nyc_gdf[nyc_gdf['comp_id']==0].reset_index(drop=True)
brklyn_qns_nbs = Queen.from_dataframe(brklyn_qns_gdf , geom_col='geometry')
from utils_bym2 import get_scaling_factor, nbs_to_adjlist
brklyn_qns_nbs_adj = nbs_to_adjlist(brklyn_qns_nbs)
tau = get_scaling_factor(brklyn_qns_nbs)

Put all NYC predictors on the same scale - log transform columns as needed.

Show Code
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = brklyn_qns_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])

Assemble all inputs into dictionary brklyn_qns_data.

Show Code
brklyn_qns_data = {"N":brklyn_qns_gdf.shape[0],
            "y":brklyn_qns_gdf['count'].astype('int'),
            "E":brklyn_qns_gdf['kid_pop'].astype('int'),
            "K":4,
            "xs":design_mat,
            "N_edges": brklyn_qns_nbs_adj.shape[1],
            "neighbors": brklyn_qns_nbs_adj,
            "tau": tau
}

Model fitting

The BYM2 model requires many warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma. We run all three models using the same seed, in order to make the initial parameters as similar as possible.

Compile and fit all three models to the Brooklyn-Queens data.

Show Code
bym2_ozs_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_ozs.stan'))
brklyn_qns_ozs_fit = bym2_ozs_mod.sample(data=brklyn_qns_data, iter_warmup=7000)
a_seed = brklyn_qns_ozs_fit.metadata.cmdstan_config['seed']

bym2_hard_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_hard.stan'))
brklyn_qns_hard_fit = bym2_hard_mod.sample(data=brklyn_qns_data, iter_warmup=7000, seed=a_seed)

bym2_soft_mod = CmdStanModel(stan_file=os.path.join('stan', 'bym2_soft.stan'))
brklyn_qns_soft_fit = bym2_soft_mod.sample(data=brklyn_qns_data, iter_warmup=7000, seed=a_seed)

Check implementations

If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models for the hierarchical parameters beta, sigma, and rho.

Tabulate summary statistics from all fits.

Show Code
brklyn_qns_ozs_summary = brklyn_qns_ozs_fit.summary()
brklyn_qns_ozs_summary.index = brklyn_qns_ozs_summary.index.astype(str) + "  - ozs"

brklyn_qns_hard_summary = brklyn_qns_hard_fit.summary()
brklyn_qns_hard_summary.index = brklyn_qns_hard_summary.index.astype(str) + " - hard"

brklyn_qns_soft_summary = brklyn_qns_soft_fit.summary()
brklyn_qns_soft_summary.index = brklyn_qns_soft_summary.index.astype(str) + " - soft"

brklyn_qns_fits_summary = pd.concat([brklyn_qns_ozs_summary, brklyn_qns_hard_summary, brklyn_qns_soft_summary])
intercept_summary = summarize_predictor(brklyn_qns_fits_summary, 'beta_intercept')
betas_summary = summarize_predictor(brklyn_qns_fits_summary, 'betas')
sigma_summary = summarize_predictor(brklyn_qns_fits_summary, 'sigma')
rho_summary = summarize_predictor(brklyn_qns_fits_summary, 'rho')

brklyn_qns_summary = pd.concat([intercept_summary, betas_summary, sigma_summary, rho_summary])
display(HTML(style_dataframe(brklyn_qns_summary, 3).to_html()))
  Mean StdDev ESS_bulk/s R_hat
beta_intercept - ozs -8.03 0.96 24.30 1.01
beta_intercept - hard -7.95 0.95 7.32 1.01
beta_intercept - soft -8.02 0.94 21.44 1.01
betas[1] - ozs 1.23 0.25 17.26 1.01
betas[1] - hard 1.24 0.26 3.07 1.01
betas[1] - soft 1.22 0.25 12.11 1.01
betas[2] - ozs 0.17 0.08 30.08 1.01
betas[2] - hard 0.16 0.08 8.01 1.01
betas[2] - soft 0.17 0.08 23.70 1.01
betas[3] - ozs 0.09 0.03 69.32 1.00
betas[3] - hard 0.09 0.03 18.11 1.00
betas[3] - soft 0.10 0.03 52.17 1.00
betas[4] - ozs 0.04 0.02 38.91 1.00
betas[4] - hard 0.03 0.02 8.13 1.00
betas[4] - soft 0.04 0.02 33.27 1.00
sigma - ozs 0.75 0.03 7.62 1.03
sigma - hard 0.75 0.03 2.10 1.02
sigma - soft 0.75 0.03 4.46 1.03
rho - ozs 0.40 0.07 5.29 1.04
rho - hard 0.39 0.07 1.46 1.02
rho - soft 0.39 0.07 3.29 1.04

The sum_to_zero vector is the most efficient. In contrast to the outcomes for the binomial model above, the soft sum-to-zero constraint is more efficient than the hard sum-to-zero constraint.

Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018

In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes. (The alternative is to add edges to the neighbor graphs so that the neighbors graph is a single connected component. This is problematic, see Notebook 6 The BYM2_multicomp model in Stan for details.)

In A note on intrinsic Conditional Autoregressive models for disconnected graphs Freni-Sterrantino et al. in 2018, the authors show how to extend the BYM2 model to handle disconnected components and singleton nodes. They provide the following recommendations:

  • Non-singleton nodes are given the BYM2 prior
  • Singleton nodes (islands) are given a standard Normal prior
  • Compute per-connected component scaling factor
  • Impose a sum-to-zero constraint on each connected component

We have followed these recommendations and implemented this model in Stan. The full model is in file bym2_multicomp.stan

For this case study, we compare 2 implementations of the BYM2_multicomp model: one which uses the sum_to_zero_vector and one which implements the soft sum-to-zero constraint.

It is necessary to constrain the the elements of the spatial effects vector phi on a component-by-component basis. Stan’s slicing with range indexes, provides a way to efficiently access each component. The helper function nyc_sort_by_comp_size both sorts the study data by component and adds the component index to the geodataframe.

In the BYM2 model for a fully connected graph the sum-to-zero constraint on phi is implemented directly by declaring phi to be a sum_to_zero_vector, which is a constrained parameter type. The declaration:

  sum_to_zero_vector[N] phi;  // spatial effects

creates a constrained variable of length \(N\), with a corresponding unconstrained variable of length \(N-1\).

In order to constrain slices of the parameter vector phi, we do the following:

  • In the parameters block, we declare the unconstrained parameter phi_raw as an regular vector vector (instead of a sum_to_zero_vector).
    • For a fully connected graph of size \(N\), the size of the unconstrained sum-to-zero vector is \(N-1\). For a disconnected graph with \(M\) non-singleton nodes, the size of phi_raw is \(M\) minus the number of connected components.
  vector[N_connected - N_components] phi_raw;  // spatial effects
  • In the functions block, we implement the unconstraining transform.

  • In the transformed parameters block, we apply the constraining transform.

  vector[N_connected] phi = zero_sum_components(phi_raw, component_idxs, component_sizes);

The constraining transform is broken into two functions:

  • function zero_sum_constrain, the actual constraining transform, which corresponds directly to the built-in zero_sum_vector transform.

  • function zero_sum_constrain_components, which handles the slicing, and calls zero_sum_constrain on each component.

  /**
   * Constrain sum-to-zero vector
   *
   * @param y unconstrained zero-sum parameters
   * @return vector z, the vector whose slices sum to zero
   */
  vector zero_sum_constrain(vector y) {
    int N = num_elements(y);
    vector[N + 1] z = zeros_vector(N + 1);
    real sum_w = 0;
    for (ii in 1:N) {
      int i = N - ii + 1; 
      real n = i;
      real w = y[i] * inv_sqrt(n * (n + 1));
      sum_w += w;
      z[i] += sum_w;     
      z[i + 1] -= w * n;    
    }
    return z;
  }
  • zero_sum_components: slices vector phi by component, applies constraining transform to each.
  /**
   * Component-wise constrain sum-to-zero vectors
   *
   * @param phi unconstrained vector of zero-sum slices
   * @param idxs component start and end indices
   * @param sizes component sizes
   * @return vector phi_ozs, the vector whose slices sum to zero
   */
  vector zero_sum_components(vector phi,
                                array[ , ] int idxs,
                                array[] int sizes) {
    vector[sum(sizes)] phi_ozs;
    int idx_phi = 1;
    int idx_ozs = 1;
    for (i in 1:size(sizes)) {
      phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] =
        zero_sum_constrain(segment(phi, idx_phi, sizes[i] - 1));
      idx_phi += sizes[i] - 1;
      idx_ozs += sizes[i];
    }
    return phi_ozs;
  }
Note

The constraining transform is a linear operation, leading to a constant Jacobian determinant which is therefore not included. As of Stan 2.36, transforms which include a Jacobian adjustment can do so with the jacobian += statement and must have names ending in _jacobian. See section functions implementing change-of-variable adjustments in the Stan User’s Guide chapter on user-defined functions for details.

Data Assembly

Get NYC dataset, pre-process as needed for BYM2_multicomp model

Show Code
from utils_nyc_map import nyc_sort_by_comp_size
from utils_bym2 import nbs_to_adjlist, get_scaling_factors
nyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))
(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)
nyc_nbs_adj = nbs_to_adjlist(nyc_nbs)
component_sizes = [x for x in nyc_comp_sizes if x > 1]
scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)

Put all NYC predictors on the same scale - log transform columns as needed.

Show Code
design_vars = np.array(['pct_pubtransit','med_hh_inc', 'traffic', 'frag_index'])
design_mat = nyc_gdf[design_vars].to_numpy()
design_mat[:, 1] = np.log(design_mat[:, 1])
design_mat[:, 2] = np.log(design_mat[:, 2])

Assemble all inputs into dictionary bym2_multicomp_data.

Show Code
bym2_multicomp_data = {
    "N":nyc_gdf.shape[0],
    "y":nyc_gdf['count'].astype('int'),
    "E":nyc_gdf['kid_pop'].astype('int'),
    "K":4,
    "xs":design_mat,
    "N_edges": nyc_nbs_adj.shape[1],
    "neighbors": nyc_nbs_adj,
    "N_components": len(component_sizes),
    "component_sizes": component_sizes,
    "scaling_factors": scaling_factors
}

Model Fitting

The BYM2_multicomp model requires many warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma, and a larger set of draws from the posterior in order to obtain satisfactory R-hat values.

Compile and fit the BYM2_multicomp model using the sum_to_zero_vector constraining transform.

Show Code
bym2_multicomp_ozs_file = os.path.join('stan', 'bym2_multicomp.stan')
bym2_multicomp_ozs_mod = CmdStanModel(stan_file=bym2_multicomp_ozs_file)
bym2_multicomp_ozs_fit = bym2_multicomp_ozs_mod.sample(data=bym2_multicomp_data, iter_warmup=9000)

Tabulate summary statistics.

Show Code
bym2_multicomp_ozs_summary = bym2_multicomp_ozs_fit.summary()
bym2_multicomp_ozs_summary[['Mean', 'StdDev', 'ESS_bulk/s', 'R_hat']].round(2).loc[
  ['beta_intercept', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'sigma', 'rho']]
Mean StdDev ESS_bulk/s R_hat
beta_intercept -5.78 0.55 10.51 1.01
betas[1] 0.82 0.16 9.86 1.01
betas[2] 0.04 0.05 9.22 1.01
betas[3] 0.03 0.02 23.19 1.00
betas[4] 0.04 0.01 13.38 1.01
sigma 0.78 0.02 3.71 1.01
rho 0.39 0.06 2.70 1.02

Model Comparison

We can compare the sum_to_zero_vector implementation to the corresponding soft sum-to-zero constraint. The full model is in file bym2_multicomp_soft.stan

In model bym2_soft.stan the soft sum-to-zero constraint is combined directly with the ICAR prior:

  target += (-0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]])
         + normal_lupdf(sum(phi) | 0, 0.001 * rows(phi)));

For the BYM2_multicomp model, this operation is carried out in two steps. First the ICAR prior is applied to phi, next we iterate through the components, applying the sum-to-zero constraint to each in turn.

  target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]);  // ICAR
  for (n in 1:N_components) {   // component-wise sum-to-zero constraint
    sum(phi[node_idxs[n, 1] : node_idxs[n, 2]]) ~ normal(0,
                             0.001 * component_sizes[n]);

The data inputs are the same. To ensure (roughly) the same initialization, we reuse the seed from bym2_multicomp_ozs_fit. This model fits very slowly and requires increasing the max_treedepth; at the default setting, all iterations hit this limit.

Compile and fit the BYM2_multicomp model using the soft sum-to-zero constraint.

Show Code
a_seed = bym2_multicomp_ozs_fit.metadata.cmdstan_config['seed']
bym2_multicomp_soft_file = os.path.join('stan', 'bym2_multicomp_soft.stan')
bym2_multicomp_soft_mod = CmdStanModel(stan_file=bym2_multicomp_soft_file)
bym2_multicomp_soft_fit = bym2_multicomp_soft_mod.sample(data=bym2_multicomp_data, max_treedepth=13, iter_warmup=9000, seed=a_seed)

The following table compares the results for the hierarchical parameters beta, rho, and sigma.

Tabulate summary statistics from both fits.

Show Code
bym2_multicomp_ozs_summary.index =  bym2_multicomp_ozs_summary.index.astype(str) + "  a) ozs"
bym2_multicomp_soft_summary = bym2_multicomp_soft_fit.summary()
bym2_multicomp_soft_summary.index =  bym2_multicomp_soft_summary.index.astype(str) + "  b) soft"
bym2_multicomp_summary = pd.concat([bym2_multicomp_ozs_summary, bym2_multicomp_soft_summary])

intercept_summary = summarize_predictor(bym2_multicomp_summary, 'beta_intercept')
betas_summary = summarize_predictor(bym2_multicomp_summary, 'betas')
sigma_summary = summarize_predictor(bym2_multicomp_summary, 'sigma')
rho_summary = summarize_predictor(bym2_multicomp_summary, 'rho')
nyc_summary = pd.concat([intercept_summary, betas_summary, sigma_summary, rho_summary])

display(HTML(style_dataframe(nyc_summary, 2).to_html()))
  Mean StdDev ESS_bulk/s R_hat
beta_intercept a) ozs -5.78 0.55 10.51 1.01
beta_intercept b) soft -5.78 0.57 0.60 1.01
betas[1] a) ozs 0.82 0.16 9.86 1.01
betas[1] b) soft 0.82 0.15 0.48 1.01
betas[2] a) ozs 0.04 0.05 9.22 1.01
betas[2] b) soft 0.04 0.05 0.46 1.01
betas[3] a) ozs 0.03 0.02 23.19 1.00
betas[3] b) soft 0.03 0.02 1.10 1.00
betas[4] a) ozs 0.04 0.01 13.38 1.01
betas[4] b) soft 0.04 0.01 0.93 1.00
sigma a) ozs 0.78 0.02 3.71 1.01
sigma b) soft 0.78 0.02 0.19 1.01
rho a) ozs 0.39 0.06 2.70 1.02
rho b) soft 0.38 0.06 0.12 1.01

Discussion

The BYM2 model has a relatively complex multilevel structure, and the BYM2_multicomp model is even more complex. Before Stan 2.36, the recommended implementation was to use the soft sum-to-zero constraint for the BYM2 model.

For the BYM2_multicomp model, model bym2_multicomp.stan shows how to implement the sum_to_zero_vector constraining transform as a Stan function. Both implementation get exactly the same estimates, confirming that both models are correctly implemented. Fitting the extended BYM2_multicomp model with a soft sum-to-zero is painfully slow. Here we show that the sum_to_zero_vector greatly improves the run time; which is critical because to fit this particular dataset requires a very large number of warmup iterations.

Conclusion: the sum_to_zero_vector just works!

The more complex the model, the greater the need for the sum_to_zero_vector. When considering the effective sample size, it is important to remember that what is most important is effective samples per second. In these experiments, the sum_to_zero_vector models consistently have the highest ESS/sec.

Key takeaways:

  • In the first part of this case study, we show how to use the sum_to_zero_vector for categorical predictors in a hierarchical model, and how to adjust the marginal variances for a normal prior.

  • In the second part of this case study we see how to implement the constraining transform used for the sum_to_zero_vector as a Stan function in order to handle the case where a vector is the concatenation of several sum-to-zero vectors.

References