produces a vector of size K such that sum(beta) = 0. The unconstrained representation requires only K - 1 values because the last is determined by the first K - 1.
If you put a normal prior on the zero-sum vector the resulting variance will be less than the intended normal variance. To get the same variance as the intended normal prior do
where sigma is the intended standard deviation. (Note: it’s slightly more efficient to pre-calculate the sqrt(N / (N - 1.0)) in transformed_data.)
Prior to Stan 2.36, a sum-to-zero constraint could be implemented either
as a “hard” sum to zero constraint, where the parameter is declared to be an \(N-1\) length vector with a corresponding \(N\)-length transformed parameter whose first \(N-1\) elements are the same as the corresponding parameter vector, and the \(N^{th}\) element is the negative sum of the \(N-1\) elements.
as a “soft” sum to zero constraint, where the parameter is an \(N\)-length vector with a prior on the sum of the elements that keeps it within \(\epsilon\) of \(0\).
The choice between the two was largely an empirical one. As a general rule, for small vectors, the hard sum-to-zero constraint is more efficient; for larger vectors, the soft sum-to-zero constraint is faster, but much depends on the specifics of the model and the data.
The hard sum-to-zero is often problematic. For small \(N\) and models with sensible priors, the hard sum-to-zero is usually satisfactory. But as the size of the vector grows, it distorts the marginal variance of the \(N^{th}\). Given a parameter vector: \[
x_1, x_2, \dots, x_{N-1} \sim \text{i.i.d. } N(0, \sigma^2)
\] by the properties of independent normal variables, each of the free elements \(x_1, \ldots, x_{N-1}\) has variance \(\sigma^2\). However, the \(N^{th}\) element is defined deterministically as: \[
x_N = -\sum_{i=1}^{N-1} x_i
\] and its variance is inflated by a factor of \(N-1\). \[
\operatorname{Var}(x_N) = \operatorname{Var}\Bigl(-\sum_{i=1}^{N-1} x_i\Bigr)
= \sum_{i=1}^{N-1} \operatorname{Var}(x_i)
= (N-1)\sigma^2.
\] For large vectors, MCMC samplers struggle with the hard sum-to-zero constraint, as every change to any of the \(N-1\) elements also requires a corresponding change to the \(N^{th}\) element; balancing these changes introduces potential non-identifiabilities.
The soft sum-to-zero constraint is problematic for the following reasons.
The tolerance \(\epsilon\) (the scale of the penalty) must be chosen by the analyst. Too large, and the result is too far from zero to be effective, too small and the sampler cannot satisfy the constraint.
The soft constraint only penalizes deviations from zero, leading to weaker identifiability of the parameters. This can lead to slow convergence and mixing, as the sampler explores nearly non-identified regions.
The marginal variances may not reflect the intended prior.
The sum_to_zero_vector transform ensures that each element of the resulting constrained vector has the same variance. This improves the sampler performance, providing fast computation and good effective sample size. This becomes increasingly noticeable as models increase in size and complexity. To demonstrate this, in this notebook we consider two different classes of models:
Multi-level regressions for binomial data with group-level categorical predictors.
Spatial models for areal data.
For these models, we provide three implementations which differ only in the implementation of the sum-to-zero constraint: the built-in sum_to_zero_vector, and the hard and soft sum-to-zero implementations. We fit each model to the same dataset, using the same random seed, and then compare the summary statistics for the constrained parameter values. Since the models are nearly equivalent except in the implied prior distributions on the vector, we expect that all three implementations should produce nearly equivalent estimates; what differs is the speed of computation, as measured by effective samples per second.
Multi-level Models with Group-level Categorical Predictors
In this section we consider a model which estimates per-demographic disease prevalence rates for a population. The model is taken from the Gelman and Carpenter, 2020 Bayesian Analysis of Tests with Unknown Specificity and Sensitivity. It combines a model for multilevel regression and post-stratification with a likelihood that accounts for test sensitivity and specificity.
The data consists of:
A set of per-demographic aggregated outcomes of a diagnostic test procedure with unequal number of tests per demographic.
A corresponding set of demographic descriptors encoded as a vector of categorical values. In this example these are named sex, age, eth, and edu, but there can be any number of demographic predictors with any number of categories.
The specified test sensitivity and specificity
Putting a sum-to-zero constraint on the categorical allows us to model all categories as offset from the mean. We prefer this to the alternative of dropping a category for reference or other contrast coding strategy.
The Stan model
The full model is in file binomial_4_preds_ozs.stan. It provides an estimate of the true prevalence based on binary tests with a given (or unknown) test sensitivity and specificity, and uses a sum_to_zero_vector to constrain the group-level parameters age, eth, and edu.
In order to put a standard normal prior on beta_age, beta_eth, and beta_edu, we need to scale the variance, as suggested above. The scaling factors are pre-computed in the transformed data block, and applied as part of the prior.
To investigate the predictive behavior of this model at different timepoints in a pandemic, we have written a data-generating program to create datasets given the baseline disease prevalence, test specificity and sensitivity, and the desired number of diagnostic tests. In the generated quantities block we use Stan’s PRNG functions to populate vthe true weights for the categorical coefficient vectors, and the relative percentages of per-category observations. Then we use a set of nested loops to generate the data for each demographic, using the PRNG equivalent of the model likelihood.
The full data-generating program is in file gen_binomial_4_preds.stan. The helper function simulate_data in file utils.py sets up the data-generating program according to the specified number of demographic categories, approximate average number of observations per category, and baseline disease prevalence, test specificity and sensitivity. This allows us to create datasets for large and small populations and for finer or more coarse-grained sets of categories. The larger the number of strata overall, the more observations are needed to get good coverage. Because the modeled data pos_tests is generated according to the Stan model’s likelihood, the model is a priori well-specified with respect to the data.
Note
Because the data-generating parameters and percentage of observations per category are generated at random, some datasets may have very low overall disease rates and/or many unobserved strata. While this is informative for understanding what is consistent with the generating distributions, it results in estimates of varying effect parameters with near-zero values, which are on the boundary of parameter space, and hence very challenging to fit. Because we wish to compare baseline performance, we have cherry-picked a seed which generates datasets that avoid this issue.
The data generating program creates unequal numbers of tests per demographic. The following tables summarize the resulting distribution.
Show Code
print("Small Dataset")print("tests total "+str(np.sum(data_small['tests'])))small_df = pd.DataFrame({'tests per demographic': data_small['tests']}).describe()small_df.drop('count', inplace=True)small_df
Small Dataset
tests total 4451
tests per demographic
mean
16.49
std
18.51
min
0.00
25%
5.00
50%
10.00
75%
21.75
max
119.00
Show Code
print("Large Dataset")print("tests total "+str(np.sum(data_large['tests'])))large_df = pd.DataFrame({'tests per demographic': data_large['tests']}).describe()large_df.drop('count', inplace=True)large_df
Large Dataset
tests total 53863
tests per demographic
mean
199.49
std
218.00
min
4.00
25%
61.25
50%
123.00
75%
258.50
max
1401.00
Show Code
print("Tiny Dataset")print("tests total "+str(np.sum(data_tiny['tests'])))tiny_df = pd.DataFrame({'tests per demographic': data_tiny['tests']}).describe()tiny_df.drop('count', inplace=True)tiny_df
Tiny Dataset
tests total 947
tests per demographic
mean
3.51
std
4.38
min
0.00
25%
1.00
50%
2.00
75%
5.00
max
28.00
Model Fitting
We fit each model to the the large and small datasets. We record the seed used for the first run and use it for all subsequent fits; this insures that the models have the same set of initial parameter values for all parameters except the constrained params beta_eth, beta_edu, and beta_age.
If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models, fit to the large and small datasets.
All models have R-hat values of 1.00 for all group-level parameters and high effective sample sizes.
The models recover the sign of the parameter, but not the exact value. In the case of lots of observations and only a few categories they do a better job of recovering the true parameters.
In almost all cases, estimates for each parameter are the same across implementations to 2 significant figures. In a few cases they are off by 0.01; where they are off, the percentage of observations for that parameter is correspondingly low.
The sum_to_zero_vector implementation has the highest number of effective samples per second, excepting a few individual parameters for which the hard sum-to-zero performs equally well.
Model efficiency
Model efficiency is measured by iterations per second, however, as the draws from the MCMC sampler may be correlated, we need to compute the number of effective samples across all chains divided by the total sampling time - this is ESS_bulk/s, the effective samples per second. The following table shows the average runtime for 100 runs of each of the three models on large and small datasets. This data was generated by script eval_efficiencies.py.
The sum_to_zero_vector is far more efficient than either the hard or soft sum-to-zero constraint.
Prior predictive checks
Prior predictive checks simulate data directly from the prior, in the absense of any observed data. Here we use the simulated dataset to examine the prior marginal variances of the elements of the sum-to-zero vector under the hard-sum-to-zero constraint and the built-in sum_to_zero transform. In particular, we are interested in the marginal variances of the elements of the sum-to-zero constrained parameters.
The above summaries show how the hard sum-to-zero constraint distorts the variance of the \(N^{th}\) element. This is only a problem for very sparse datasets, where the prior swamps the data. With data, this problem goes away; to see this we fit the hard sum-to-zero constraint model on the tiny and small datasets.
Marginal variances of the hard sum-to-zero constraint, tiny dataset
Comparison of these three models with different data regimes demonstrates the following.
For a multi-level model with group-level categorical predictors the sum_to_zero_vector provides fast results and good effective sample sizes for both datasets.
Model binomial_4preds_ozs.stan. shows how to properly scale the variance of a sum_to_zero_vector constrained parameter in order to put a standard normal prior on it.
Prior predictive checks demonstrate the difference between the marginal variances of the sum_to_zero_vector and hard sum-to-zero implementations.
Spatial Models with an ICAR component
Spatial auto-correlation is the tendency for adjacent areas to share similar characteristics. Conditional Auto-Regressive (CAR) and Intrinsic Conditional Auto-Regressive (ICAR) models, first introduced by Besag, 1974, account for this by pooling information from neighboring regions. The BYM model, (Besag, York, Mollié, 1991) extends a lognormal Poisson model plus ICAR component for spatial auto-correlation by adding an ordinary random-effects component for non-spatial heterogeneity. The BYM2 model builds on this model and subsequent refinements.
The ICAR, BYM2, and BYM2_multicomp models are more fully explained in a series of notebooks available from GitHub repo: https://github.com/mitzimorris/geomed_2024, see notebooks:
The key element of the BYM2 model is the ICAR component. Its conditional specification is a multivariate normal random vector \(\mathbf{\phi}\) where each \({\phi}_i\) is conditional on the values of its neighbors.
The joint specification rewrites to a Pairwise Difference,
Each \({({\phi}_i - {\phi}_j)}^2\) contributes a penalty term based on the distance between the values of neighboring regions. However, \(\phi\) is non-identifiable, constant added to \(\phi\) washes out of \({\phi}_i - {\phi}_j\). Therefore, a sum-to-zero constraint is needed to both identify and center \(\phi\).
The Stan implementation of the ICAR component computes the sum of the pairwise distances by representing the spatial adjacency matrix as a array of pairs of neighbor indices.
data { ...// spatial structureint<lower = 0> N_edges; // number of neighbor pairsarray[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
The ICAR prior comes into the model as parameter phi.
The ICAR model requires that the neighbor graph is fully connected for two reasons:
The joint distribution is computed from the pairwise differences between a node and its neighbors; singleton nodes have no neighbors and are therefore undefined.
Even if the graph doesn’t have any singleton nodes, when the graph has multiple connected components a sum-to-zero constraint on the entire vector fails to properly identify the model.
Because the BYM2 model includes an ICAR component, it too requires a fully connected neighbor graph. In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes.
First we use the BYM2 model for fully connected graphs to compare implementations of the sum-to-zero vector on phi, and then we extend the BYM2 model following the INLA implementation developed by Freni-Sterrantino et al. in 2018 and presented in A note on intrinsic Conditional Autoregressive models for disconnected graphs.
The dataset we’re using is that used in the analysis published in 2019 Bayesian Hierarchical Spatial Models: Implementing the Besag York Mollié Model in Stan. It data consists of motor vehicle collisions in New York City, as recorded by the NYC Department of Transportation, between the years 2005-2014, restricted to collisions involving school age children 5-18 years of age as pedestrians. Each crash was localized to the US Census tract in which it occurred, using boundaries from the 2010 United States Census, using the 2010 Census block map for New York City. New York City is comprised of several large and small islands, a chunk of the mainland (the Bronx), and the southern tip of Long Island Sound (Brooklyn and Queens). To analyze the full New York City dataset requires the extended BYM2 model, which we do in the second part of this section.
Model 1: The BYM2 model, Riebler et al. 2016
We compare three implementations of the BYM2 model which differ only in their implementation of phi, which is constrained to sum to zero.
BYM2 v1: sum_to_zero_vector
Parameter phi is declared as a sum_to_zero_vector. See file bym2_ozs.stan.
BYM2 v2: Hard sum-to-zero constraint
Parameter phi_raw is an vector of size N-1 and the N-length vector phi is computed in the transformed parameters block. See file bym2_hard.stan.
BYM2 v3: soft sum-to-zero constraint
Parameter phi is a vector of size N and its sum is soft-centered on \(0\). See file bym2_soft.stan.
Data assembly
The inputs to the BYM2 model are
The Poisson regression data
int<lower=0> N - number of regions
array[N] int<lower=0> y - per-region count outcome
vector<lower=0>[N] E - the population of each region (a.k.a. “exposure”),
int<lower=1> K - the number of predictors
matrix[N, K] xs - the design matrix
The spatial structure
int<lower = 0> N_edges - the number of neighbor pairs
real tau - the scaling factor, introduced in the BYM2
Note
The scaling factor tau was introduced by Riebler et al so that the variance of the spatial and ordinary random effects are both approximately equal to 1, thus allowing for a straightforward estimate of the amount of spatial and non-spatial variance. We have written a helper function called get_scaling_factor, in file utils_bym2.py which takes as its argument the neighbor graph and computes the geometric mean of the corresponding adjacency matrix.
To assemble the spatial data from the US Census data maps we use a set of utilities described more fully in notebooks 2 and 5 of the geomed_2024 Spatial Data Processing in Stan workshop. The neighbors graph of New York City consists of several large connected components, notably Brooklyn and Queens, the Bronx, Manhattan, and Staten Island, and a few other smaller components and singletons. The helper function nyc_soft_by_comp_size adds component info to the geodataframe and sorts the data by component size.
Get NYC dataset, pre-process as needed for BYM2 model, display neighbors graph.
Show Code
from utils_nyc_map import nyc_sort_by_comp_sizenyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)from splot.libpysal import plot_spatial_weights print("New York City neighbors graph")plot_spatial_weights(nyc_nbs, nyc_gdf)
New York City neighbors graph
The BYM2 model requires that the neighbors graph be fully connected - use the largest subcomponent Brooklyn and Queens.
The BYM2 model requires many warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma. We run all three models using the same seed, in order to make the initial parameters as similar as possible.
Compile and fit all three models to the Brooklyn-Queens data.
If all three implementations are correct, we expect that they will produce the same estimates for all parameters. As always, we check the R-hat values and effective sample sizes. To do this, we tabulate the summary statistics from each of the three models for the hierarchical parameters beta, sigma, and rho.
The sum_to_zero vector is the most efficient. In contrast to the outcomes for the binomial model above, the soft sum-to-zero constraint is more efficient than the hard sum-to-zero constraint.
Model 2: The BYM2_multicomp model, Freni-Sterrantino et al, 2018
In order to apply the BYM2 model to the full NYC dataset, it is necessary to extend the BYM2 model to account for disconnected components and singleton nodes. (The alternative is to add edges to the neighbor graphs so that the neighbors graph is a single connected component. This is problematic, see Notebook 6 The BYM2_multicomp model in Stan for details.)
Singleton nodes (islands) are given a standard Normal prior
Compute per-connected component scaling factor
Impose a sum-to-zero constraint on each connected component
We have followed these recommendations and implemented this model in Stan. The full model is in file bym2_multicomp.stan
For this case study, we compare 2 implementations of the BYM2_multicomp model: one which uses the sum_to_zero_vector and one which implements the soft sum-to-zero constraint.
It is necessary to constrain the the elements of the spatial effects vector phi on a component-by-component basis. Stan’s slicing with range indexes, provides a way to efficiently access each component. The helper function nyc_sort_by_comp_size both sorts the study data by component and adds the component index to the geodataframe.
In the BYM2 model for a fully connected graph the sum-to-zero constraint on phi is implemented directly by declaring phi to be a sum_to_zero_vector, which is a constrained parameter type. The declaration:
sum_to_zero_vector[N] phi; // spatial effects
creates a constrained variable of length \(N\), with a corresponding unconstrained variable of length \(N-1\).
In order to constrain slices of the parameter vector phi, we do the following:
In the parameters block, we declare the unconstrained parameter phi_raw as an regular vector vector (instead of a sum_to_zero_vector).
For a fully connected graph of size \(N\), the size of the unconstrained sum-to-zero vector is \(N-1\). For a disconnected graph with \(M\) non-singleton nodes, the size of phi_raw is \(M\) minus the number of connected components.
The constraining transform is broken into two functions:
function zero_sum_constrain, the actual constraining transform, which corresponds directly to the built-in zero_sum_vector transform.
function zero_sum_constrain_components, which handles the slicing, and calls zero_sum_constrain on each component.
/** * Constrain sum-to-zero vector * * @param y unconstrained zero-sum parameters * @return vector z, the vector whose slices sum to zero */vector zero_sum_constrain(vector y) {int N = num_elements(y);vector[N + 1] z = zeros_vector(N + 1);real sum_w = 0;for (ii in1:N) {int i = N - ii + 1; real n = i;real w = y[i] * inv_sqrt(n * (n + 1)); sum_w += w; z[i] += sum_w; z[i + 1] -= w * n; }return z; }
zero_sum_components: slices vector phi by component, applies constraining transform to each.
/** * Component-wise constrain sum-to-zero vectors * * @param phi unconstrained vector of zero-sum slices * @param idxs component start and end indices * @param sizes component sizes * @return vector phi_ozs, the vector whose slices sum to zero */vector zero_sum_components(vector phi,array[ , ] int idxs,array[] int sizes) {vector[sum(sizes)] phi_ozs;int idx_phi = 1;int idx_ozs = 1;for (i in1:size(sizes)) { phi_ozs[idx_ozs : idx_ozs + sizes[i] - 1] = zero_sum_constrain(segment(phi, idx_phi, sizes[i] - 1)); idx_phi += sizes[i] - 1; idx_ozs += sizes[i]; }return phi_ozs; }
Note
The constraining transform is a linear operation, leading to a constant Jacobian determinant which is therefore not included. As of Stan 2.36, transforms which include a Jacobian adjustment can do so with the jacobian += statement and must have names ending in _jacobian. See section functions implementing change-of-variable adjustments in the Stan User’s Guide chapter on user-defined functions for details.
Data Assembly
Get NYC dataset, pre-process as needed for BYM2_multicomp model
Show Code
from utils_nyc_map import nyc_sort_by_comp_sizefrom utils_bym2 import nbs_to_adjlist, get_scaling_factorsnyc_geodata = gpd.read_file(os.path.join('data', 'nyc_study.geojson'))(nyc_nbs, nyc_gdf, nyc_comp_sizes) = nyc_sort_by_comp_size(nyc_geodata)nyc_nbs_adj = nbs_to_adjlist(nyc_nbs)component_sizes = [x for x in nyc_comp_sizes if x >1]scaling_factors = get_scaling_factors(len(component_sizes), nyc_gdf)
Put all NYC predictors on the same scale - log transform columns as needed.
The BYM2_multicomp model requires many warmup iterations in order to reach convergence for all parameters, including hyperparameters rho and sigma, and a larger set of draws from the posterior in order to obtain satisfactory R-hat values.
Compile and fit the BYM2_multicomp model using the sum_to_zero_vector constraining transform.
We can compare the sum_to_zero_vector implementation to the corresponding soft sum-to-zero constraint. The full model is in file bym2_multicomp_soft.stan
In model bym2_soft.stan the soft sum-to-zero constraint is combined directly with the ICAR prior:
For the BYM2_multicomp model, this operation is carried out in two steps. First the ICAR prior is applied to phi, next we iterate through the components, applying the sum-to-zero constraint to each in turn.
The data inputs are the same. To ensure (roughly) the same initialization, we reuse the seed from bym2_multicomp_ozs_fit. This model fits very slowly and requires increasing the max_treedepth; at the default setting, all iterations hit this limit.
Compile and fit the BYM2_multicomp model using the soft sum-to-zero constraint.
The BYM2 model has a relatively complex multilevel structure, and the BYM2_multicomp model is even more complex. Before Stan 2.36, the recommended implementation was to use the soft sum-to-zero constraint for the BYM2 model.
For the BYM2_multicomp model, model bym2_multicomp.stan shows how to implement the sum_to_zero_vector constraining transform as a Stan function. Both implementation get exactly the same estimates, confirming that both models are correctly implemented. Fitting the extended BYM2_multicomp model with a soft sum-to-zero is painfully slow. Here we show that the sum_to_zero_vector greatly improves the run time; which is critical because to fit this particular dataset requires a very large number of warmup iterations.
Conclusion: the sum_to_zero_vector just works!
The more complex the model, the greater the need for the sum_to_zero_vector. When considering the effective sample size, it is important to remember that what is most important is effective samples per second. In these experiments, the sum_to_zero_vector models consistently have the highest ESS/sec.
Key takeaways:
In the first part of this case study, we show how to use the sum_to_zero_vector for categorical predictors in a hierarchical model, and how to adjust the marginal variances for a normal prior.
In the second part of this case study we see how to implement the constraining transform used for the sum_to_zero_vector as a Stan function in order to handle the case where a vector is the concatenation of several sum-to-zero vectors.
References
Fraser, 1951: Normal Samples With Linear Constraints and Given Variances. Canadian Journal of Mathematics, 3, 363–366. doi:10.4153/CJM-1951-041-9.