Baby steps in Bayes: Incorporating reliability estimates in regression models
Researchers sometimes calculate reliability indices such as Cronbach’s $\alpha$ or Revelle’s $\omega_T$, but their statistical models rarely take these reliability indices into account. Here I want to show you how you can incorporate information about the reliability about your measurements in a statistical model so as to obtain more honest and more readily interpretable parameter estimates.
Reliability and measurement error
This blog post is the sequel to the previous one, where I demonstrated how imperfectly measured control variables undercorrect for the actual confounding in observational studies (also see Berthele & Vanhove 2017; Brunner & Austin 2009; Westfall & Yarkoni 2016). A model that doesn’t account for measurement error on the confounding variable—and hence implicitly assumes that the confound was measured perfectly—may confidently conclude that the variable of actual interest is related to the outcome even when taking into account the confound. From such a finding, researchers typically infer that the variable of actual interest is causally related to the outcome even in absence of the confound. But once this measurement error is duly accounted for, you may find that the evidence for a causal link between the variable of interest and the outcome is more tenuous than originally believed.
So especially in observational studies, where confounds abound, it behooves researchers to account for the measurement error in their variables so that they don’t draw unwarranted conclusions from their data too often. The amount of measurement error on your variables is usually unknown. But if you’ve calculated some reliability estimate such as Cronbach’s $\alpha$ for your variables, you can use this to obtain an estimate of the amount of measurement error.
To elaborate, in classical test theory, the reliability $\rho_{XX’}$ of a measure is equal to the ratio of the variance of the (errorfree) true scores to the variance of the observed scores. The latter is the sum of the variance of the true scores and the error variance:
\[\rho_{XX'} = \frac{\textrm{true score variance}}{\textrm{true score variance} + \textrm{measurement error variance}} = \frac{\sigma^2_T}{\sigma^2_T + \sigma^2_E}\]Rearranging, we get
\[\sigma^2_E = \sigma^2_T\left(\frac{1}{\rho_{XX'}}1\right)\]None of these values are known, but they can be estimated based on the sample. Specifically, $\rho_{XX’}$ can be estimated by a reliability index such as Cronbach’s $\alpha$ and the sum $\sigma^2_T + \sigma^2_E$ can be estimated by computing the variable’s sample variance.
A simulated example
Let’s first deal with a simulated dataset. The main advantage of analysing simulated data is that you check that what comes out of the model corresponds to what went into the data. In preparing this blog post, I was able to detect an arithmetic error in my model code in this way as one parameter was consistently underestimated. Had I applied the model immediately to the real data set, I wouldn’t have noticed anything wrong. But we’ll deal with real data afterwards.
Run these commands to follow along:
Generating the construct scores
The scenario we’re going to simulate is one in which you have two
correlated
predictor variables (A
and B
) and one outcome variable (Z
).
Unbeknownst to the analyst, Z
is causally affected by A
but not
by B
. Moreover, the three variables are measured with some degree
of error, but we’ll come to that later. Figure 1 depicts the scenario
for which we’re going to simulate data.
Figure 1. Graphical causal model for the simulated data. A set of unobserved factors $U$ gives rise to a correlation between
A
andB
. Even though onlyA
causally affectsZ
, an association betweenB
andZ
will be found unlessA
is controlled for.
The first thing we need are two correlated predictor variables.
I’m going to generate these from a bivariate normal distribution.
A
has a mean of 3 units and a standard deviation of 1.5 units,
and B
has mean 4 and standard deviation 0.8 units.
The correlation between them is $\rho = 0.73$.
To generate a sample from this bivariate normal distribution,
you need to construct the variancecovariance matrix from the
standard deviations and correlation, which I do in the code below:
Next, we need to generate the outcome. In this simulation, Z
depends linearly on A
but not on B
(hence ‘$0 \times B_i$’).
The error term $\varepsilon$ is drawn from a normal distribution with
standard deviation 1.3. Importantly, this error term does not express
the measurement error on Z
; it is the part of the true score variance in Z
that isn’t related to either A
or B
:
Even though B
isn’t causally related to Z
, we find that
B
and Z
are correlated thanks to B
’s correlation to A
.
Figure 2.
B
is causally unrelated toZ
, but these two variables are still correlated becauseB
correlates withA
andA
andZ
are causally related.
A multiple regression model is able to tease apart the effects
of A
and B
on Z
:
The confound A
is significantly related to Z
and its
estimated regression parameter is close to
its true value of 0.70 ($\widehat{\beta_A} = 0.69 \pm 0.07$). The
variable of interest B
, by contrast, isn’t significantly
related to Z
when A
has been accounted for
($\widehat{\beta_B} = 0.02 \pm 0.13$). This is all as it should be.
Adding measurement error
Now let’s add measurement error to all variables.
The A
values that we actually observe will then be distorted versions
of the true A
values:
The ‘noise’ $\varepsilon_{A}$ is commonly assumed to be normally distributed:
\[\varepsilon_{Ai} \sim \textrm{Normal}(0, \tau_A)\]That said, you can easily imagine situations where the noise likely has a different distribution. For instance, when measurements are measured to the nearest integer (e.g., body weights in kilograms), the noise is likely uniformly distributed (e.g., a reported body weight of 66 kg means that the true body weight lies between 65.5 and 66.5 kg).
To make the link with the analysis more transparant, I will express the
amount of noise in terms of the variables’ reliabilities. For the confound
A
, I set the reliability at 0.70. Since A
’s standard deviation was 1.5 units,
this means that the standard deviation of the noise is $\sqrt{1.5^2\left(\frac{1}{0.70}  1\right)} = 0.98$
units. I set the reliability for B
, the variable of actual interest at 0.90.
Its standard deviation is 0.8, so the standard deviation of the noise is
$\sqrt{0.8^2\left(\frac{1}{0.90}  1\right)} = 0.27$ units.
The same logic applies to adding measurement noise to Z
.
The difficulty here lies in obtaining the population standard deviation (or variance)
of Z
. I don’t want to just plug in Z
’s sample standard deviation since I want
to have exact knowledge of the population parameters.
While we specified a sigma_Z.AB
above, this is not the
total population standard deviation of Z
: it’s the population standard deviation of Z
once
Z
has been controlled for A
and B
. To obtain the total standard deviation
of Z
(here admittedly confusingly labelled $\sigma_Z$), we need to add in
the variance in Z
due to A
and B
:
Since $\beta_B = 0$, this simplifies to $\sigma_Z = \sqrt{(\beta_A\sigma_A)^2 + \sigma_{Z.AB}^2}$, but if you want to simulate your own datasets, the full formula may be useful.
The population standard deviation of Z
is thus
$\sqrt{(0.7 \times 1.5)^2 + 1.3^2} = 1.67$. Setting Z
’s reliability to 0.70,
we find that the standard deviation of the noise is
$\sqrt{1.67^2\left(\frac{1}{0.70}  1\right)} = 1.09$.
Figure 3 shows the causal diagram for the actually observed simulated data.
Figure 3. Graphical causal model for the simulated data.
A
,B
andZ
are now unobserved (latent) variables, but socalled ‘descendants’ of them were measured. These reflect their latent constructs imperfectly. Crucially, controlling forobs_A
will not entirely control for the confound: the path betweenobs_B
andobs_Z
stays open.
As Figure 4 shows, the observed variables are all correlated with each other.
Figure 4.
Crucially, controlling for obs_A
in a regression model
doesn’t entirely eradicate the confound and we find that
obs_B
is significantly related to obs_Z
even after
controlling for obs_A
($\widehat{\beta_{B_{\textrm{obs}}}} = 0.40 \pm 0.15$).
Moreover, the regression model on the observed variables
underestimates the strength of the relationship
between the true construct scores ($\widehat{\beta_{A_{\textrm{obs}}}} = 0.40 \pm 0.07$, whereas $\beta_A = 0.7$).
Statistical model
The statistical model, written below in Stan code, corresponds to the data generating mechanism above and tries to infer its parameters from the observed data and some prior information.
The data
block specifies the input that the model should handle.
I think this is selfexplanatory. Note that the latent variable scores
A
, B
and Z
aren’t part of the input as we wouldn’t have directly observed
these.
The parameters
block first defines the parameters needed for the
regression model with the unobserved latent variables (the one we used to
generate Z
). It then defines the parameters needed to generate the
true variables scores for A
and B
as well as the parameters needed
to generate the observed scores from the true scores (viz., the true
scores themselves and the reliabilities). Note that it is crucial to allow
the model to estimate a correlation between A
and B
, otherwise it won’t
‘know’ that A
confounds the B
Z
relationship.
The transformed parameters
block contains, well, transformations of these
parameters. For instance, the standard deviations of A
and B
and
the correlation between A
and B
are used to generate a variancecovariance
matrix. Moreover, the standard deviations of the measurement noise are
computed using the latent variables’ standard deviation and their reliabilities.
The model
block, finally, specifies how we think the observed data and
the (transformed or untransformed) parameters fit together and what plausible
a priori values for the (transformed or untransformed) parameters are.
These prior distributions are pretty abstract in this example: we generated
contextfree data ourselves, so it’s not clear what motivates these priors.
The real example to follow will hopefully make more sense in this respect.
You’ll notice that I’ve also specified a prior for the reliabilities.
The reason is that you typically don’t know the reliability of an observed
variable with perfect precision but that you have sample estimate with some
inherent uncertainty. The priors reflect this uncertainty. Again, this will
become clearer in the real example to follow.
Put the data in a list and fit the model:
Results
The model recovers the true parameter values pretty well (Table 1)
and, on the basis of this model, you wouldn’t erroneously conclude
that B
is causally related to Z
(see the parameter estimate for slope_B
).
Table 1. Comparison of the true parameter values and their estimates. Parameters prefixed with ‘*’ are transformations of other parameters. The estimated reliability parameters aren’t really estimated from the data: they just reflect the prior distributions put on them.
Parameter  True value  Estimate 

intercept  2.00  $0.67 \pm 1.53$ 
slope_A  0.70  $0.86 \pm 0.17$ 
slope_B  0.00  $0.20 \pm 0.26$ 
sigma_Z.AB  1.30  $1.17 \pm 0.12$ 
sd_A  1.50  $1.40 \pm 0.07$ 
sd_B  0.80  $0.80 \pm 0.03$ 
mean_A  3.00  $3.06 \pm 0.10$ 
mean_B  4.00  $4.02 \pm 0.05$ 
rho  0.73  $0.78 \pm 0.05$ 
reliability_A  0.70  $0.69 \pm 0.04$ 
reliability_B  0.90  $0.90 \pm 0.03$ 
reliability_Z  0.70  $0.70 \pm 0.04$ 
*sd_Z  1.67  $1.59 \pm 0.07$ 
*error_A  0.98  $0.92 \pm 0.07$ 
*error_B  0.27  $0.26 \pm 0.04$ 
*error_Z  1.09  $1.05 \pm 0.08$ 
In the previous blog post, I’ve shown that such a model also estimates the latent true variable scores and that these estimates correspond more closely to the actual true variable scores than do the observed variable scores. I’ll skip this step here.
Reallife example: Interdependence
Satisfied that our model can recover the actual parameter values
in scenarios such as those depicted in Figure 3, we now turn to a reallife
example of such a situation. The example was already described in
the previous blog post; here I’ll just draw the causal model that assumes
that reflects the null hypothesis that a child’s Portuguese skills at T2 (PT.T2
)
don’t contribute to their French skills at T3 (FR.T3
), but that
due to common factors such as intelligence, form on the day etc. ($U$),
French skills and Portuguese skills at T2 are correlated across children.
What is observed are test scores, not the children’s actual skills.
Figure 5.
Data
The command below is pigugly, but allows you to easily read in the data.
Statistical model
The only thing that’s changed in the statistical model compared to the example with the simulated data is that I’ve renamed the parameters and that the prior distributions are better motivated. Let’s consider each prior distribution in turn:

intercept ~ normal(0.2, 0.1);
: The intercept is the average true French skill score at T3 for children whose true French and Portuguese skill scores at T2 are 0. This is the lowest possible score (the theoretical range of the data is [0, 1]), so we’d expect such children to perform poorly at T3, too. Anormal(0.2, 0.1)
distribution puts 95% probability on such children having a true French score at T3 between 0 and 0.4. 
slope_FR ~ normal(0.5, 0.25);
: This parameter expresses the difference between the average true French skill score at T3 for children with a true French skill score of 1 at T2 (the theoretical maximum) vs. those with a true French skill score of 0 at T2 (the theoretical minimum). This is obviously some value between 1 and 1, and presumably it’s going to be positive. Anormal(0.5, 0.25)
puts 95% probability on this difference lying between 0 and 1, which I think is reasonable. 
slope_PT ~ normal(0, 0.25);
: The slope for Portuguese is bound to be smaller than the one for French. Moreover, it’s not a given that it will be appreciably different from zero. Hence a prior centred on 0 that still gives the data a chance to pull the estimate in either direction. 
sigma ~ normal(0.15, 0.08);
: If neither of the T2 variables predicts T3, uncertainty is highest when the mean T3 score is 0.5. Since these scores are bounded between 0 and 1, the standard deviation could not be much higher than 0.20. But French T2 is bound to be a predictor, so let us choose a slightly lower value (0.15). 
latent_means ~ normal(0.5, 0.1);
: These are the prior expectations for the true score means of the T2 variables. 0.5 lies in the middle of the scale; thenormal(0.5, 0.1)
prior puts 95% probability on these means to lie between 0.3 and 0.7. 
sigma_lat_FR_T2 ~ normal(0, 0.25);
,sigma_lat_FR_T2 ~ normal(0, 0.25);
: The standard deviations of the latent T2 variables. If these truncated normal distributions put a 95% probability of the latent standard deviations to be lower than 0.50. 
latent_rho ~ normal(0.4, 0.3);
: The a priori expected correlation between the latent variablesA
andB
. These are bound to be positively correlated. 
reliability_FR_T2 ~ beta(100, 100*0.27/0.73);
The prior distribution for the reliability of the French T2 variable. Cronbach’s $\alpha$ for this variable was 0.73 (95% CI: [0.65, 0.78]). This roughly corresponds to abeta(100, 100*0.27/0.73)
distribution:
reliability_PT_T2 ~ beta(120, 120*0.21/0.79);
Similarly, Cronbach’s $\alpha$ for the Portuguese T2 variable was 0.79 (95% CI: [0.72, 0.84]), which roughly corresponds to abeta(120, 120*0.21/0.79)
distribution:
reliability_FR_T3 ~ beta(73, 27);
: The estimated reliability for the French T3 data was similar to that of the T2 data, so I used the same prior.
Results
Unsurprisingly, the model confidently finds a link between French skills at T2 and at T3, even on the level of the unobserved true scores ($\widehat{\beta}_{\textrm{French}} = 0.71 \pm 0.16$). But more importantly, the evidence for an additional effect of Portuguese skills at T2 on French skills at T3 is flimsy ($\widehat{\beta}_{\textrm{Portuguese}} = 0.11 \pm 0.14$). The latent T2 variables are estimated to correlate strongly at $\widehat{\rho} = 0.81 \pm 0.08$. These results don’t change much when a flat prior on $\rho$ is specified (this can be accomplished by not specifying any prior at all for $\rho$).
Compared to the model in the previous blog post (Table 2),
little has changed. The only appreciable difference is that
the estimate for sigma
is lower. The reason is that, unlike the previous
model, the current model partitions the variance in the French T3 scores into
true score variance and measurement error variance. In this model, sigma
captures the
true score variance that isn’t accounted for by T2 skills, whereas in the
previous model, sigma
captured the total variance that wasn’t accounted for
by T2 skills. But other than that, the current model doesn’t represent a huge
change from the previous one.
Table 2. Comparison of parameter estimates between the model fitted in this blog post and the model fitted in the previous one. In the previous model, the outcome was assumed to be perfectly reliable.
Parameter  Current estimate  Previous estimate 

intercept  $0.19 \pm 0.05$  $0.19 \pm 0.05$ 
slope_FR  $0.71 \pm 0.16$  $0.71 \pm 0.16$ 
slope_PT  $0.11 \pm 0.14$  $0.10 \pm 0.14$ 
sigma  $0.07 \pm 0.02$  $0.12 \pm 0.01$ 
latent_rho  $0.81 \pm 0.08$  $0.81 \pm 0.08$ 
A couple of things still remain to be done. First, the French test at T3 was the same as the one at T2 so it’s likely that the measurement errors on both scores won’t be completely independent of one another. I’d like to find out how correlated measurement errors affect the parameter estimates. Second, I’d like to get started with prior and posterior predictive checks: the former to check if the priors give rise to largely possible data patterns, and the latter to check if the full model tends to generate data sets similar to the one actually observed.
References
Berthele, Raphael and Jan Vanhove. 2017. What would disprove interdependence? Lessons learned from a study on biliteracy in Portuguese heritage language speakers in Switzerland. International Journal of Bilingual Education and Bilingualism.
Brunner, Jerry and Peter C. Austin. 2009. Inflation of Type I error rate in multiple regression when independent variables are measured with error. Canadian Journal of Statistics 37(1). 33–46.
Westfall, Jacob and Tal Yarkoni. 2016. Statistically controlling for confounding constructs is harder than you think.. PLOS ONE 11(3). e0152719.