Causal Inference: Sodium intake and Blood Pressure Simulation and ATE Estimation
Last updated: Jan 12, 2022
A recent XKCD comic took me back to my Causal Inference course and gives off Goldilocks vibes.
I recently was going through Brady Neal’s excellent Introduction to Causal Inference Course. He goes through an excellent simulated dataset to demonstrate how different covariate adjustment approaches can influence your coefficient estimates. This simulation comes from LuqueFernandez et al.’s 2018 paper, Educational Note: Paradoxical collider effect in the analysis of noncommunicable disease epidemiological data: a reproducible illustration and web application with an accompanying shiny app. Their example estimates the effect of 24h dietary sodium intake on systolic blood pressure with the covariates of age (a confounder) and 24h urinary protein excretion (acts as a collider). Brady’s Python implementation is available on GitHub.
I implemented the code myself to illustrate the importance of drawing DAGs to lay out your assumptions and selecting the appropriate variables for adjustment. Unlike with observational data where the truth is unknown, here we use simulated data and define the “true” treatment effect. Andrew Heiss’s course material is great to work with dagitty
, ggdag
, and causal inference methods more generally. This post is heavily influenced by his work as well.
Generate the data as outlined in the LuqueFernandez et al. paper:
library(dplyr, warn.conflicts = FALSE)
library(ggplot2)
library(ggdag, warn.conflicts = FALSE)
library(dagitty)
# Function to generate data like LuqueFernandez et al. (2018)
generate_data < function(N=100, binary_cutoff=3.5, binary_treatment=TRUE, beta1=1.05, alpha1=0.4, alpha2=0.3) {
set.seed(42)
age < rnorm(N, mean = 65, sd = 5)
sodium < age/18 + rnorm(N)
# If binary_treatment=TRUE > binarize to 1 if Na intake > 3.5/day
ifelse(binary_treatment == TRUE,
sodium < ifelse(sodium > binary_cutoff, 1L, 0L),
sodium
)
blood_pressure < beta1 * sodium + 2 * age + rnorm(N)
proteinuria < alpha1 * sodium + alpha2 * blood_pressure + rnorm(N)
return(data.table::data.table(age, sodium, blood_pressure, proteinuria))
}
df < generate_data(N = 5e4) %>%
mutate(sodium_num = sodium) %>%
mutate(sodium = factor(sodium))
Working with DAGs
ðŸ’¡Conditioning on \(Z\) removes the dependency between \(X\) and \(Y\) for pipes and forks. Conditioning on \(Z\) opens the dependency between \(X\) and \(Y\) for colliders, i.e. for a collider relationship you are closed until you condition on \(Z\).
DAGs are both causal models and statistical models. Richard McElreath breaks DAGs down into the Four Elemental Confounds: Forks, Chains (Pipes), Colliders, and Descendants. By contrast, Miguel Hernan makes the distinction between common causes and common effects. Conditioning on common causes, such as forks and chains, creates an association, whereas conditioning on common effects (e.g. colliders) does not create an association.

Forks: \(X \leftarrow Z \rightarrow Y\)
 transmits a noncausal association
 conditioning on \(Z\) blocks transmission
 aka confounding

Chains (aka Pipes): \(X \rightarrow Z \rightarrow Y\)
 transmits a causal association
 conditioning on (controlling for) \(Z\) blocks transmission
 aka mediation

Colliders (aka Inverted fork, aka Immoralities): \(X \rightarrow Z \leftarrow Y\)
 does not transmit any association
 conditioning on \(Z\) opens transmission of a noncausal association
 i.e., controlling for \(Z\) introduces a spurious association
 aka collider bias
The presence of a common cause (e.g. forks, chains) between \(X\) and \(Y\) makes us expect an association between \(X\) and \(Y\) even if \(X\) does not cause \(Y\). For example, \(X \leftarrow Z \rightarrow Y\), where \(X\) is yellow fingers, \(Y\) is lung cancer and \(Z\) is smoking. Smoking causes yellow fingers and smoking also can cause lung cancer, but yellow fingers doesn’t cause lung cancer. Yet, per graph theory, we cannot exclude an association between \(X\) and \(Y\) when \(X\) and \(Y\) have a common cause, \(Z\) in this case, even if there is no arrow between \(X\) and \(Y\).
In the above example, we can break this unconditional/marginal association by conditioning on \(Z\), i.e. \(X \leftarrow \boxed{Z} \rightarrow Y\). This says that \(X\) and \(Y\) are conditionally independent, i.e. \(X\) and \(Y\) are independent conditional on \(Z\). In other words, the flow of association between \(X\) and \(Y\) is interrupted when we condition on their common cause \(Z\).
ðŸ’¡ Notation: a square box \(\boxed{Z}\) is accepted notation indicating that you are conditioning on this variable.
Draw the DAG
Draw the DAG with the dagitty
package ðŸ“¦
example_dag < dagify(
bp ~ tx + age,
tx ~ age,
urine_protein ~ tx + bp,
exposure = "tx",
outcome = "bp",
labels = c(
tx = "Sodium Intake",
bp = "Blood Pressure",
age = "Age",
urine_protein = "Urinary Protein"
),
coords = list(
x = c(tx = 1, age = 2, urine_protein = 2, bp = 3),
y = c(tx = 2, age = 3, urine_protein = 1, bp = 2)
)
)
ggdag_status(example_dag) + #, use_labels = "label", text = FALSE
theme_dag()
See the Path(s)
Using dagitty
: Find all the paths between \(X\) and \(Y\) using the paths()
function from the dagitty
package ðŸ“¦
paths(example_dag)
## $paths
## [1] "tx > bp" "tx > urine_protein < bp"
## [3] "tx < age > bp"
##
## $open
## [1] TRUE FALSE TRUE
Notice how the output also tells you if those paths are open/closed.
Using ggdag
: Find all the paths between \(X\) and \(Y\) using the ggdag_paths()
function from the ggdag
package ðŸ“¦
ggdag_paths(example_dag) +
theme_dag()
What to adjust for?
Instead of listing out all the possible paths and identifying backdoors by hand, you can use the adjustmentSets()
function in the dagitty
ðŸ“¦ to programmatically find all the nodes that need to be adjusted.
dagitty::adjustmentSets(example_dag)
## { age }
The output of adjustmentSets()
tells us that we should control for the age variable.
The ggdag
way to identify which variables to control for is as to use ggdag_adjustment_set()
:
ggdag_adjustment_set(example_dag, shadow = TRUE) +
theme_dag()
The result of either approach is the minimally sufficient adjustment set(s), which includes the fewest number of adjustments needed to close all backdoors between treatment \(T\), Na intake, and \(Y\), blood pressure.
ATE Estimation
The Average Treatment Effect (or ATE) is the average of the Individual Treatment Effects (ITE) across the population. The ITE is the treatment effect on an individual unit. A discussion on ITE, ATE, ATT, CATE, etc. probably warrants another post, but Lucy D’Agastino McGowan has an excellent post on propensity scores that features ATE, ATT, etc. worth checking out.
ITE:
$$ \text{ITE} = \tau_i \triangleq Y_i(1)  Y_i(0) $$
ATE:
$$ \tau_{\text{ATE}} \triangleq \mathbb{E}[Y_i(1)  Y_i(0)] = \mathbb{E}[Y(1)  Y(0)] $$
Now let’s try to estimate the ATE, which we already know is \(1.05\) in this simulation study by adjusting for different variables. Before we do, let’s just visualize the effect of high and low Na intake  \(T = 1\) and \(T = 0\), respectively  on SBP.
df %>%
ggplot(aes(x = sodium, y = blood_pressure)) +
geom_boxplot() +
cowplot::theme_cowplot(font_size = 11, font_family = "Fira Code")
Naive approach  “too little”
For our naive unadjusted model, we’ll only control for the treatment variable, sodium intake:
model_naive < lm(blood_pressure ~ sodium,
data = df)
model_naive %>%
broom::tidy()
## # A tibble: 2 Ã— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 128. 0.0650 1963. 0
## 2 sodium1 5.29 0.0886 59.8 0
Our ATE estimate from our naive model is 5.29, but the “truth” from our simulation is \(1.05\). That means we’re overestimating the effect by 404.1%!
Overdoing it  “too extra”
Now let us see what happens if we adjust for all observed covariates:
model_all < lm(blood_pressure ~ sodium + age + proteinuria,
data = df)
model_all %>%
broom::tidy()
## # A tibble: 4 Ã— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0637 0.0558 1.14 0.254
## 2 sodium1 0.861 0.00924 93.2 0
## 3 age 1.83 0.00259 709. 0
## 4 proteinuria 0.274 0.00406 67.5 0
Now our ATE estimate for the treatment is 0.86. So we’re getting closer to the true ATE, but still off the mark by 18%.
“Just Right”
As we saw in our DAG above, proteinuria is a collider variable so it’d be wise in this case not to adjust for proteinuria. The adjustment set computed from our DAG in 0.1.3 only contained the age variable, so let’s try that now.
model_lm < lm(blood_pressure ~ sodium + age,
data = df)
model_lm %>%
broom::tidy()
## # A tibble: 3 Ã— 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.0601 0.0583 1.03 0.303
## 2 sodium1 1.06 0.00916 116. 0
## 3 age 2.00 0.000908 2202. 0
For this model, our ATE estimate is 1.06 , which is much closer to our true ATE value of \(1.05\).
Comparing models
huxtable::huxreg(
"Naive model" = model_naive,
"Full model" = model_all,
"Causal model" = model_lm
)
Naive model  Full model  Causal model  

(Intercept)  127.683 ***  0.064  0.060 
(0.065)  (0.056)  (0.058)  
sodium1  5.293 ***  0.861 ***  1.058 *** 
(0.089)  (0.009)  (0.009)  
age  1.834 ***  1.999 ***  
(0.003)  (0.001)  
proteinuria  0.274 ***  
(0.004)  
N  50000  50000  50000 
R2  0.067  0.991  0.990 
logLik  185440.908  68646.013  70828.860 
AIC  370887.816  137302.027  141665.720 
*** p < 0.001; ** p < 0.01; * p < 0.05. 