Open In Colab

6. The Haunted DAG and The Causal Terror

# Install packages that are not installed in colab
try:
  import google.colab
  %pip install --upgrade -q daft
  %pip install -q causalgraphicalmodels
  %pip install -q watermark
  %pip install git+https://github.com/ksachdeva/rethinking-tensorflow-probability.git
except:
  pass
%load_ext watermark
# Core
import numpy as np
import arviz as az
import pandas as pd
import tensorflow as tf
import tensorflow_probability as tfp
from statsmodels.regression import linear_model

# visualization
import matplotlib.pyplot as plt

import daft
from causalgraphicalmodels import CausalGraphicalModel

from rethinking.data import RethinkingDataset
from rethinking.data import dataframe_to_tensors
from rethinking.mcmc import sample_posterior

# aliases
tfd = tfp.distributions
tfb = tfp.bijectors
Root = tfd.JointDistributionCoroutine.Root
2022-01-19 18:56:08.883549: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-01-19 18:56:08.883587: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
%watermark -p numpy,tensorflow,tensorflow_probability,arviz,scipy,pandas,statsmodels,daft,causalgraphicalmodels,rethinking
numpy                 : 1.21.5
tensorflow            : 2.7.0
tensorflow_probability: 0.15.0
arviz                 : 0.11.4
scipy                 : 1.7.3
pandas                : 1.3.5
statsmodels           : 0.13.1
daft                  : 0.1.2
causalgraphicalmodels : 0.0.4
rethinking            : 0.1.0
# config of various plotting libraries
%config InlineBackend.figure_format = 'retina'

Berkson’s paradox - A false observation of a negative coorelation between 2 positive traits.

Members of a population which have some positive trait tend to lack a second even though -:

  • The traits may be unrelated

  • Or, they may be even positively related.

e.g Resturants at good location have bad food even though location & food have no correlation

Berkon’s paradox is also called selection-distortion effect.

The gist of the idea here is that when a sample is selected on a combination of 2 (or more) variables, the relationship between those 2 variables is different after selection than it was before.

Because of above, we should always be cautious of adding more predictor variables to our regression as it may introduce statistical selection with in the model. The phenomenon has a name and it is called COLLIDER BIAS.

There are actually 3 types of hazards when we add more predictor variables -

  • Multicollinearity

  • Post treatment bias

  • Collider bias

Collider - a beam in which two particles are made to collide (collision)

Overthinking : Simulated Science Distorion

Code 6.1

A simulated example to demonstrate selection-distortion effect. It has been observed that newsworthiness and trustworthiness of papers accepted by journal are factors in getting the funding.

Below code shows how they correlated with each other.

# A simulated example to demonstrate selection-disortion
#
# Here

_SEED = 1914

N = 200
p = 0.1  # proportion to select

# uncorrelated newsworthiness & trustworthiness
seed = tfp.util.SeedStream(_SEED, salt="")
nw = tfd.Normal(loc=0.0, scale=1.0).sample(N, seed=seed()).numpy()
seed = tfp.util.SeedStream(_SEED, salt="")
tw = tfd.Normal(loc=0.0, scale=1.0).sample(N, seed=seed()).numpy()

# select top 10% of combined scores
s = nw + tw  # total score
q = np.quantile(s, 1 - p)  # top 10% threshold

selected = np.where(s >= q, True, False)

np.corrcoef(tw[selected], nw[selected])[0, 1]
2022-01-19 18:56:11.517832: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-01-19 18:56:11.517869: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-01-19 18:56:11.517894: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (fv-az272-145): /proc/driver/nvidia/version does not exist
2022-01-19 18:56:11.519166: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
-0.828225254885134

This shows that nw (newsworthiness) & tw (trustworthiness) are heavily negivately correlated.

6.1 Multicollinearity

Multicollinearity happens when the predictor variables are stronly correlated.

When this happens the posterior distribution will seem to suggest that none of the variables is reliably associated with the outcome.

Model will still infer correct results but it would be difficult to understand it.

6.1.1 Multicollinear legs

Here we create an artifical dataset about height & its relation to the length of the legs as predictor variables.

The two legs are more or less of same length so they are correlated with each other. Including both of the legs as the predictors result in multicollinearity

Code 6.2

_SEED = 909

N = 100


def generate_height_leg_data():
    seed = tfp.util.SeedStream(_SEED, salt="leg_exp")
    height = tfd.Normal(loc=10.0, scale=2.0).sample(N, seed=seed()).numpy()
    leg_prop = tfd.Uniform(low=0.4, high=0.5).sample(N, seed=seed()).numpy()

    # left & right leg as proportion + error
    leg_left = (
        leg_prop * height + tfd.Normal(loc=0, scale=0.02).sample(N, seed=seed()).numpy()
    )

    leg_right = (
        leg_prop * height + tfd.Normal(loc=0, scale=0.02).sample(N, seed=seed()).numpy()
    )

    # build a dataframe using above
    d = pd.DataFrame({"height": height, "leg_left": leg_left, "leg_right": leg_right})

    return d


d = generate_height_leg_data()

d.describe()
height leg_left leg_right
count 100.000000 100.000000 100.000000
mean 10.244188 4.586804 4.589090
std 1.971723 0.856999 0.857432
min 6.569585 2.711091 2.734310
25% 8.705090 3.959075 3.970327
50% 10.478082 4.628943 4.644671
75% 11.372977 5.143220 5.147640
max 15.219750 7.189007 7.178730

Code 6.3

Very vague and bad priors are used in the model here so that we can be sure that priors are not responsible for what we are about to observe

tdf = dataframe_to_tensors("SimulatedHeight", d, ["leg_left", "leg_right", "height"])


def model_6_1(leg_left_data, leg_right_data):
    def _generator():
        alpha = yield Root(
            tfd.Sample(tfd.Normal(loc=10.0, scale=100.0, name="alpha"), sample_shape=1)
        )
        betaL = yield Root(
            tfd.Sample(tfd.Normal(loc=2.0, scale=10.0, name="betaL"), sample_shape=1)
        )
        betaR = yield Root(
            tfd.Sample(tfd.Normal(loc=2.0, scale=10.0, name="betaR"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = tf.squeeze(
            alpha[..., tf.newaxis]
            + betaL[..., tf.newaxis] * leg_left_data
            + betaR[..., tf.newaxis] * leg_right_data
        )

        scale = sigma[..., tf.newaxis]

        height = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="height"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_1 = model_6_1(tdf.leg_left, tdf.leg_right)
NUM_CHAINS_FOR_6_1 = 2

# Note -
# I got somewhat ok looking results with this init values
init_state = [
    tf.ones([NUM_CHAINS_FOR_6_1]),
    tf.ones([NUM_CHAINS_FOR_6_1]),
    tf.ones([NUM_CHAINS_FOR_6_1]),
    tf.ones([NUM_CHAINS_FOR_6_1]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_data = (tdf.height,)

posterior_6_1, trace_6_1 = sample_posterior(
    jdc_6_1,
    observed_data=observed_data,
    num_chains=NUM_CHAINS_FOR_6_1,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=4000,
    burnin=500,
    params=["alpha", "betaL", "betaR", "sigma"],
)

az.summary(trace_6_1, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 0.340 0.380 -0.182 0.978 0.059 0.042 47.0 60.0 1.05
betaL 0.386 1.346 -1.506 2.602 0.807 0.644 3.0 15.0 2.01
betaR 1.772 1.331 -0.416 3.749 0.793 0.682 3.0 15.0 1.99
sigma 0.663 0.047 0.587 0.735 0.001 0.001 1322.0 1987.0 1.00

Notice betaL and betaR values. How can they be so different ? Afterall we know that the legs are more or less of same lenght so their effects should be both equal. This is confusing indeed !

az.plot_forest(trace_6_1, hdi_prob=0.89, combined=True)
array([<AxesSubplot:title={'center':'89.0% HDI'}>], dtype=object)
../_images/06_the_haunted_dag_and_the_causal_terror_20_1.png

Code 6.5

Looking at the bivariate posterior distribution for betaR & betaL.

Since both variables contain identical information, the posterior is a narrow ridge of negatively correlated values.

fig, ax = plt.subplots(1, 1, figsize=[7, 3])

plt.scatter(posterior_6_1["betaR"], posterior_6_1["betaL"], alpha=0.05, s=20)
ax.set_xlabel("br")
ax.set_ylabel("bl")
Text(0, 0.5, 'bl')
../_images/06_the_haunted_dag_and_the_causal_terror_22_1.png

Code 6.6

Below plot shows that the posterior distribution of sum of the two parameters is centered on the proper association of either leg with height

sum_blbr = posterior_6_1["betaL"] + posterior_6_1["betaR"]
az.plot_kde(sum_blbr.numpy(), label="sum of bl and br");
../_images/06_the_haunted_dag_and_the_causal_terror_24_0.png

Code 6.7

We will now do the regression using only one leg and the posterior mean will be approximately the same.

Building the model using only left leg.

def model_6_2(leg_left_data):
    def _generator():
        alpha = yield Root(
            tfd.Sample(tfd.Normal(loc=10.0, scale=100.0, name="alpha"), sample_shape=1)
        )
        betaL = yield Root(
            tfd.Sample(tfd.Normal(loc=2.0, scale=10.0, name="betaL"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = tf.squeeze(alpha[..., tf.newaxis] + betaL[..., tf.newaxis] * leg_left_data)

        scale = sigma[..., tf.newaxis]

        height = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="height"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_2 = model_6_2(tdf.leg_left)
NUM_CHAINS_FOR_6_2 = 2

# Note -
# I got somewhat ok looking results with this init values
init_state = [
    tf.ones([NUM_CHAINS_FOR_6_2]),
    tf.ones([NUM_CHAINS_FOR_6_2]),
    tf.ones([NUM_CHAINS_FOR_6_1]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_data = (tdf.height,)

posterior_6_2, trace_6_2 = sample_posterior(
    jdc_6_2,
    observed_data=observed_data,
    num_chains=NUM_CHAINS_FOR_6_2,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=4000,
    burnin=500,
    params=["alpha", "betaL", "sigma"],
)

az.summary(trace_6_2, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 0.250 0.325 -0.283 0.759 0.035 0.025 87.0 130.0 1.02
betaL 2.179 0.070 2.071 2.294 0.008 0.005 89.0 132.0 1.02
sigma 0.662 0.047 0.587 0.736 0.001 0.001 1496.0 2735.0 1.00

The value of betaL is now identical to sum of betaL & betaL (see the chart from Code 6.6) from previous models

6.1.2 Multicollinear milk

In real datasets, it may be difficult to identify the highly correlated predictors. This has the consequence of coming to a conclusion that none of the predictor is significant.

Here the primate milk dataset is used as another example for multicollinearity challenge.

Code 6.8

Loading the dataset and creating new standardized (centered) columns

d = RethinkingDataset.Milk.get_dataset()

d["K"] = d["kcal.per.g"].pipe(lambda x: (x - x.mean()) / x.std())
d["F"] = d["perc.fat"].pipe(lambda x: (x - x.mean()) / x.std())
d["L"] = d["perc.lactose"].pipe(lambda x: (x - x.mean()) / x.std())
tdf = dataframe_to_tensors("Milk", d, ["K", "F", "L"])

Code 6.9

We are building 2 models here.

  • KCal regressed on perc.fat

  • KCal regressed on perc.lactose

# KCal is regressed on perc.fat


def model_6_3(per_fat):
    def _generator():
        alpha = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.2, name="alpha"), sample_shape=1)
        )
        betaF = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="betaF"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = alpha[..., tf.newaxis] + betaF[..., tf.newaxis] * tf.transpose(
            per_fat[..., tf.newaxis]
        )
        scale = sigma[..., tf.newaxis]

        K = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="K"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_3 = model_6_3(tdf.F)
NUM_CHAINS_FOR_6_3 = 2

init_state = [
    tf.zeros([NUM_CHAINS_FOR_6_3]),
    tf.zeros([NUM_CHAINS_FOR_6_3]),
    tf.ones([NUM_CHAINS_FOR_6_3]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Exp()]

posterior_6_3, trace_6_3 = sample_posterior(
    jdc_6_3,
    observed_data=(tdf.K,),
    init_state=init_state,
    bijectors=bijectors,
    params=["alpha", "betaF", "sigma"],
)
# KCal is regressed on perc.lactose


def model_6_4(per_lac):
    def _generator():
        alpha = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.2, name="alpha"), sample_shape=1)
        )
        betaL = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="betaL"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = alpha[..., tf.newaxis] + betaL[..., tf.newaxis] * tf.transpose(
            per_lac[..., tf.newaxis]
        )
        scale = sigma[..., tf.newaxis]

        K = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="K"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_4 = model_6_4(tdf.L)
NUM_CHAINS_FOR_6_4 = 2

init_state = [
    tf.zeros([NUM_CHAINS_FOR_6_4]),
    tf.zeros([NUM_CHAINS_FOR_6_4]),
    tf.ones([NUM_CHAINS_FOR_6_4]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Exp()]

posterior_6_4, trace_6_4 = sample_posterior(
    jdc_6_4,
    observed_data=(tdf.K,),
    init_state=init_state,
    bijectors=bijectors,
    params=["alpha", "betaL", "sigma"],
)
az.summary(trace_6_3, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 0.015 0.088 -0.128 0.149 0.008 0.006 114.0 146.0 1.00
betaF 0.869 0.090 0.715 0.996 0.008 0.005 136.0 200.0 1.02
sigma 0.487 0.067 0.379 0.580 0.004 0.003 259.0 89.0 1.04
az.summary(trace_6_4, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha -0.001 0.077 -0.110 0.136 0.007 0.005 115.0 199.0 1.02
betaL -0.898 0.080 -1.023 -0.777 0.005 0.004 264.0 411.0 1.01
sigma 0.414 0.059 0.319 0.494 0.002 0.001 1121.0 401.0 1.00

Posterior for betaF & betaL are mirror images of each other.

This seems to imply that both predictors have strong association with the outcome. Notice the sd of both the predictors; they are quite low.

In next section we will see what happens when we combine both of them in the regression model

Code 6.10

We build a model where we include both of the predictors (per_fat & per_lac)

def model_6_5(per_fat, per_lac):
    def _generator():
        alpha = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.2, name="alpha"), sample_shape=1)
        )
        betaF = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="betaF"), sample_shape=1)
        )
        betaL = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="betaL"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = (
            alpha[..., tf.newaxis]
            + betaF[..., tf.newaxis] * tf.transpose(per_fat[..., tf.newaxis])
            + betaL[..., tf.newaxis] * tf.transpose(per_lac[..., tf.newaxis])
        )

        scale = sigma[..., tf.newaxis]

        K = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="K"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_5 = model_6_5(tdf.F, tdf.L)
NUM_CHAINS_FOR_6_5 = 2

init_state = [
    tf.zeros([NUM_CHAINS_FOR_6_5]),
    tf.zeros([NUM_CHAINS_FOR_6_5]),
    tf.zeros([NUM_CHAINS_FOR_6_5]),
    tf.ones([NUM_CHAINS_FOR_6_5]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Identity(), tfb.Exp()]

posterior_6_5, trace_6_5 = sample_posterior(
    jdc_6_5,
    observed_data=(tdf.K,),
    num_chains=NUM_CHAINS_FOR_6_5,
    init_state=init_state,
    bijectors=bijectors,
    params=["alpha", "betaF", "betaL", "sigma"],
)

az.summary(trace_6_5, hdi_prob=0.89)
WARNING:tensorflow:5 out of the last 5 calls to <function run_hmc_chain at 0x7f6409dec710> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha -0.006 0.068 -0.112 0.100 0.004 0.003 250.0 457.0 1.01
betaF 0.260 0.197 -0.022 0.564 0.015 0.010 170.0 200.0 1.01
betaL -0.659 0.204 -0.964 -0.347 0.016 0.012 149.0 129.0 1.01
sigma 0.413 0.063 0.321 0.511 0.003 0.002 361.0 196.0 1.02

Notice how the standard deviations of the posterior for betaF & betaL has jumped up significantly.

per_fat & per_lac both contain much of the same information and can be substitued for each other. Therefore, the model has not choice but to describe a long ridge of combination of betaF & betaL that are equally plaussible.

Code 6.11

We plot again like in the legs example to visualize the relationship

az.plot_pair(d[["kcal.per.g", "perc.fat", "perc.lactose"]].to_dict("list"));
../_images/06_the_haunted_dag_and_the_causal_terror_46_0.png
  • perc.fat is positively related to outcome

  • perc.lac is negatively related to outcome

  • perc.fat & perc.lac are negatively coorelated to each other

Either of them helps in predicting KCal but neither helps much once we already know the other.

Overthinking : Simulating collinearity

Simulating collinearity using Milk dataset

Code 6.12

d = RethinkingDataset.Milk.get_dataset()


def simcoll(r=0.9):

    mean = r * d["perc.fat"].values
    sd = np.sqrt((1 - r ** 2) * np.var(d["perc.fat"].values))

    x = tfd.Normal(loc=mean, scale=sd).sample()

    X = np.column_stack((d["perc.fat"], x))
    m = linear_model.OLS(d["kcal.per.g"], X).fit()
    cov = m.cov_params()
    return (np.diag(cov)[1]) ** 0.5


def repsimcoll(r=0.9, N=100):
    stddev = [simcoll(r) for _ in range(N)]
    return np.mean(stddev)


lista = []
for i in np.arange(start=0, stop=0.99, step=0.01):
    lista.append(repsimcoll(r=i, N=100))

plt.plot(np.arange(start=0, stop=0.99, step=0.01), lista)
plt.xlabel("correlation", fontsize=14)
plt.ylabel("stddev", fontsize=14);
../_images/06_the_haunted_dag_and_the_causal_terror_50_0.png

6.2 Post-treatment bias

Omitted Variable bias - Problems that arise because of not including predictior variables

Post-treatment bias - Problems that arise becuase of including improper predictor variables

Code 6.13

Simulate data to see what goes wrong when we include a post-treatment variable

_SEED = 71

# Note - even after providing SeedStream and generated seeds
# it still does not respect it
def simulate():
    seed = tfp.util.SeedStream(_SEED, salt="sim_heights")

    # number of plants
    N = 100

    # simulate initial heights
    h0 = tfd.Normal(loc=10.0, scale=2.0).sample(N, seed=seed())

    # assign treatments and simulate fungus and growth
    treatment = tf.repeat([0.0, 1.0], repeats=N // 2)

    fungus = tfd.Binomial(total_count=1.0, probs=(0.5 - treatment * 0.4)).sample(
        seed=seed()
    )

    h1 = h0 + tfd.Normal(loc=5.0 - 3.0 * fungus, scale=1.0).sample(seed=seed())

    # compose a clean data frame
    d = {"h0": h0, "h1": h1, "treatment": treatment, "fungus": fungus}

    return d


d_dict = simulate()

az.summary(d_dict, kind="stats", hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5%
h0 9.894 2.143 6.421 12.93
h1 13.609 2.696 8.782 17.34
treatment 0.500 0.503 0.000 1.00
fungus 0.350 0.479 0.000 1.00
d = pd.DataFrame.from_dict(d_dict)
tdf = dataframe_to_tensors("SimulatedTreatment", d, ["h0", "h1", "treatment", "fungus"])

6.2.1 A prior is born

We will use a prior (see Code 6.14) that expects anything from 40% shrinkage up to 50% growth

Code 6.14

sim_p = tfd.LogNormal(loc=0.0, scale=0.25).sample((int(1e4),))
az.summary({"sim_p": sim_p}, kind="stats", hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5%
sim_p 1.031 0.261 0.648 1.438

Code 6.15

In this model only height (h0) as the predictor is included

def model_6_6(h0):
    def _generator():
        p = yield Root(
            tfd.Sample(tfd.LogNormal(loc=0.0, scale=0.25, name="p"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = tf.squeeze(h0[tf.newaxis, ...] * p[..., tf.newaxis])
        scale = sigma[..., tf.newaxis]

        h1 = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h1"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_6 = model_6_6(tdf.h0)
NUM_CHAINS_FOR_6_6 = 2

init_state = [tf.ones([NUM_CHAINS_FOR_6_6]), tf.ones([NUM_CHAINS_FOR_6_6])]

bijectors = [tfb.Exp(), tfb.Exp()]

posterior_6_6, trace_6_6 = sample_posterior(
    jdc_6_6,
    observed_data=(tdf.h1,),
    num_chains=NUM_CHAINS_FOR_6_6,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=2000,
    burnin=1000,
    params=["p", "sigma"],
)

az.summary(trace_6_6, hdi_prob=0.89)
WARNING:tensorflow:6 out of the last 6 calls to <function run_hmc_chain at 0x7f6409dec710> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has experimental_relax_shapes=True option that relaxes argument shapes that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for  more details.
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
p 1.357 0.020 1.324 1.387 0.000 0.000 1981.0 1291.0 1.00
sigma 1.907 0.135 1.691 2.113 0.004 0.003 1040.0 1010.0 1.01

The result shows that with height as the only predictor variable in the model there is about 40% growth on average.

Code 6.16

In this model we include treatment & fungus variables as well

def model_6_7(h0, treatment, fungus):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.LogNormal(loc=0.0, scale=0.2, name="a"), sample_shape=1)
        )
        bt = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="bt"), sample_shape=1)
        )
        bf = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="bf"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        p = (
            a[..., tf.newaxis]
            + bt[..., tf.newaxis] * treatment
            + bf[..., tf.newaxis] * fungus
        )

        mu = h0 * p
        scale = sigma[..., tf.newaxis]

        h1 = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h1"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_7 = model_6_7(tdf.h0, tdf.treatment, tdf.fungus)
NUM_CHAINS_FOR_6_7 = 2

init_state = [
    tf.ones([NUM_CHAINS_FOR_6_7]),
    tf.zeros([NUM_CHAINS_FOR_6_7]),
    tf.zeros([NUM_CHAINS_FOR_6_7]),
    tf.ones([NUM_CHAINS_FOR_6_7]),
]

bijectors = [tfb.Exp(), tfb.Identity(), tfb.Identity(), tfb.Exp()]

posterior_6_7, trace_6_7 = sample_posterior(
    jdc_6_7,
    observed_data=(tdf.h1,),
    num_chains=NUM_CHAINS_FOR_6_7,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=2000,
    burnin=1000,
    params=["a", "bt", "bf", "sigma"],
)

az.summary(trace_6_7, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a 1.455 0.029 1.409 1.501 0.001 0.001 1039.0 1136.0 1.00
bt -0.011 0.034 -0.070 0.037 0.001 0.001 1254.0 1324.0 1.00
bf -0.277 0.036 -0.328 -0.214 0.001 0.001 1213.0 1114.0 1.00
sigma 1.388 0.099 1.240 1.551 0.005 0.003 410.0 657.0 1.01

Note that a is same as p in the previous model i.e. about 40% growth on average

Marginal posterior for bt (effect of treatment) is about 0 with a tight interval. This implies treatment is not associated with the growth.

Fungus (bf) seems to actually hurt the growth.

Clearly this does not makes us happy because we know that the treatment does matter. Next section tries to find the answers.

6.2.2 Blocked by consequence

Fungus is a consequence of treatment i.e. fungus is a post-treatment variable.

When we include Fungus variable, the model tries to find whether a plat developed fungus or not. However, what we really want to understand is the impact of treatment on growth.

This can be done by omitting the post-treatment variable fungus

Code 6.17

In this model, fungus predictor is not included

def model_6_8(h0, treatment):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.LogNormal(loc=0.0, scale=0.2, name="a"), sample_shape=1)
        )
        bt = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=0.5, name="bt"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        p = a[..., tf.newaxis] + bt[..., tf.newaxis] * treatment

        mu = h0 * p
        scale = sigma[..., tf.newaxis]

        h1 = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h1"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_8 = model_6_8(tdf.h0, tdf.treatment)
NUM_CHAINS_FOR_6_8 = 2

init_state = [
    tf.ones([NUM_CHAINS_FOR_6_8]),
    tf.zeros([NUM_CHAINS_FOR_6_8]),
    tf.ones([NUM_CHAINS_FOR_6_8]),
]

bijectors = [tfb.Exp(), tfb.Identity(), tfb.Exp()]

posterior_6_8, trace_6_8 = sample_posterior(
    jdc_6_8,
    observed_data=(tdf.h1,),
    num_chains=NUM_CHAINS_FOR_6_8,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=2000,
    burnin=1000,
    params=["a", "bt", "sigma"],
)

az.summary(trace_6_8, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a 1.287 0.025 1.247 1.326 0.001 0.000 2245.0 1777.0 1.0
bt 0.134 0.036 0.081 0.194 0.001 0.000 3878.0 1516.0 1.0
sigma 1.798 0.128 1.587 1.990 0.004 0.003 1086.0 1481.0 1.0

We can now see that bt (effect of treatment) has a positive impact on the model.

6.2.3 Fungus and d-separation

Above we built models by including and excluding various potential predictors. However, it helps to represent these relations graphically with the help of DAG.

Code 6.18

Here we are using causalgraphicalmodel package

plant_dag = CausalGraphicalModel(
    nodes=["H0", "H1", "F", "T"], edges=[("H0", "H1"), ("F", "H1"), ("T", "F")]
)
pgm = daft.PGM()
coordinates = {"H0": (0, 0), "T": (4, 0), "F": (3, 0), "H1": (2, 0)}
for node in plant_dag.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in plant_dag.dag.edges:
    pgm.add_edge(*edge)
pgm.render()
plt.gca().invert_yaxis()
../_images/06_the_haunted_dag_and_the_causal_terror_74_0.png

The graph clearly shows how treatment’s (T) impact on height (H1) is blocked by fungus (F)

Another way to say above is - If we condition on F it induces D-SEPARATION. The “d” stands for directional

H1 is conditionally independent of T if we include F (the blocker)

Code 6.19

all_independencies = plant_dag.get_all_independence_relationships()
for s in all_independencies:
    if all(
        t[0] != s[0] or t[1] != s[1] or not t[2].issubset(s[2])
        for t in all_independencies
        if t != s
    ):
        print(s)
('F', 'H0', set())
('T', 'H1', {'F'})
('T', 'H0', set())

Here is how we want to read the above result -

  • H0 is conditionally independent on T (note the empty set). In other words, original height H0 should not be associated with T

  • H0 is conditionally independent on F (note the empty set). In other words, original height H0 should not be associated with F

  • H1 is conditionally independent on T if we include F

Code 6.20

A synthetic data generation to modifiy the plant growth simulation so that fungus has no influence on growth, but moisture M influences both H1 and F

_SEED = 71

# Note - even after providing SeedStream and generated seeds
# it still does not respect it
def simulate():
    seed = tfp.util.SeedStream(_SEED, salt="sim_heights")

    # number of plants
    N = 1000

    # simulate initial heights
    h0 = tfd.Normal(loc=10.0, scale=2.0).sample(N, seed=seed())

    # assign treatments and simulate fungus and growth
    treatment = tf.repeat([0.0, 1.0], repeats=N // 2)

    M = tf.cast(tfd.Bernoulli(probs=0.5).sample(N, seed=seed()), dtype=tf.float32)

    fungus = tfd.Binomial(
        total_count=1.0, probs=(0.5 - treatment * 0.4 + M * 0.4)
    ).sample(seed=seed())

    h1 = h0 + tfd.Normal(loc=5.0 + 3.0 * M, scale=1.0).sample(seed=seed())

    # compose a clean data frame
    d = {"h0": h0, "h1": h1, "treatment": treatment, "fungus": fungus}

    return d


d = simulate()

az.summary(d, kind="stats", hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5%
h0 9.994 2.06 6.494 12.935
h1 16.544 2.80 12.268 20.809
treatment 0.500 0.50 0.000 1.000
fungus 0.491 0.50 0.000 1.000

6.3 Collider bias

6.3.1 Collider of false sorrow

Question we consider in this section is how aging influences happiness ?

We also know that happiness (H) & age (A) both cause marriage (M). This makes marriage (M) a collider

H —-> M <—– A

Code 6.21

Simulating (synthetic data generation) happiness and its relationship with age & marriage

  • Each year, 20 people are born with uniformly distributed happiness values

  • Each year, each person ages one year. Happiness does not change.

  • At age 18, individuals can be married. The odds of marriage each year are proportional to an individual’s happiness

  • Once married, an individual remains married

  • After age 65, individuals leave the sample. (They move to Spain)

# Port of R code from here https://github.com/rmcelreath/rethinking/blob/master/R/sim_happiness.R
def sim_happiness(seed=1977, N_years=1000, max_age=65, N_births=20, aom=18):
    # age existing individuals & newborns
    A = np.repeat(np.arange(1, N_years + 1), N_births)
    # sim happiness trait - never changes
    H = np.repeat(np.linspace(-2, 2, N_births)[None, :], N_years, 0).reshape(-1)
    # not yet married
    M = np.zeros(N_years * N_births, dtype=np.uint8)

    def update_M(i, M):
        # for each person over 17, chance get married
        married = tfd.Bernoulli(logits=(H - 4.0)).sample(seed=seed + i).numpy()
        return np.where((A >= i) & (M == 0.0), married, M)

    def fori_loop(lower, upper, body_fun, init_val):
        val = init_val
        for i in range(lower, upper):
            val = body_fun(i, val)
        return val

    M = fori_loop(aom, max_age + 1, update_M, M)

    # mortality
    deaths = A > max_age
    A = A[~deaths]
    H = H[~deaths]
    M = M[~deaths]

    d = pd.DataFrame({"age": A, "married": M, "happiness": H})
    return d


d = sim_happiness(seed=1977, N_years=1000)
d.describe()
age married happiness
count 1300.000000 1300.000000 1.300000e+03
mean 33.000000 0.292308 -8.335213e-17
std 18.768883 0.454998 1.214421e+00
min 1.000000 0.000000 -2.000000e+00
25% 17.000000 0.000000 -1.000000e+00
50% 33.000000 0.000000 -1.110223e-16
75% 49.000000 1.000000 1.000000e+00
max 65.000000 1.000000 2.000000e+00

Code 6.22

Rescaling age so that the range from 18 to 65 is one unit

We are adding a new column called “A” that ranges from 0 to 1 where 0 is age 18 and 1 is age 65

Happiness is on an arbitary scale from -2 to +2

d2 = d[d.age > 17].copy()  # only adults
d2["A"] = (d2.age - 18) / (65 - 18)

Code 6.23

Here we are doing multiple regression. The linear model is

\(\mu_i\) = \(\alpha_{MID[i]}\) + \(\beta_A\) * \(A_i\)

d2["mid"] = d2.married

tdf = dataframe_to_tensors(
    "SimulatedHappiness",
    d2,
    {"mid": tf.int32, "A": tf.float32, "happiness": tf.float32},
)


def model_6_9(mid, A):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="a"), sample_shape=2)
        )
        bA = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=2.0, name="bA"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = tf.squeeze(tf.gather(a, mid, axis=-1)) + bA[..., tf.newaxis] * A
        scale = sigma[..., tf.newaxis]

        h = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_9 = model_6_9(mid=tdf.mid, A=tdf.A)
NUM_CHAINS_FOR_6_9 = 2

init_state = [
    tf.ones([NUM_CHAINS_FOR_6_9, 2]),
    tf.zeros([NUM_CHAINS_FOR_6_9]),
    tf.ones([NUM_CHAINS_FOR_6_9]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_value = (tdf.happiness,)

posterior_6_9, trace_6_9 = sample_posterior(
    jdc_6_9,
    observed_data=observed_value,
    num_chains=NUM_CHAINS_FOR_6_9,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=2000,
    burnin=1000,
    params=["a", "bA", "sigma"],
)

az.summary(trace_6_9, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a[0] -0.192 0.063 -0.296 -0.096 0.002 0.001 1067.0 1691.0 1.0
a[1] 1.333 0.087 1.207 1.486 0.003 0.002 892.0 1366.0 1.0
bA -0.827 0.114 -1.019 -0.654 0.004 0.003 760.0 983.0 1.0
sigma 0.990 0.023 0.954 1.023 0.001 0.000 1146.0 1893.0 1.0

The model here is quite sure that age is negatively associated with happiness.

Code 6.24

What would happen if omit the marriage status ?

Note - \(a\) is not any more a vector; just a plain intercept

def model_6_10(A):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="a"), sample_shape=1)
        )
        bA = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=2.0, name="bA"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = a[..., tf.newaxis] + bA[..., tf.newaxis] * A
        scale = sigma[..., tf.newaxis]

        h = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_10 = model_6_10(A=tdf.A)
NUM_CHAINS_FOR_6_10 = 2

init_state = [
    tf.ones([NUM_CHAINS_FOR_6_10]),
    tf.zeros([NUM_CHAINS_FOR_6_10]),
    tf.ones([NUM_CHAINS_FOR_6_10]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_value = (tdf.happiness,)

posterior_6_10, trace_6_10 = sample_posterior(
    jdc_6_10,
    observed_data=observed_value,
    num_chains=NUM_CHAINS_FOR_6_10,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=2000,
    burnin=1000,
    params=["a", "bA", "sigma"],
)

az.summary(trace_6_10, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a 0.002 0.077 -0.122 0.122 0.002 0.002 1017.0 1369.0 1.0
bA -0.005 0.133 -0.216 0.196 0.004 0.003 927.0 1288.0 1.0
sigma 1.216 0.028 1.172 1.262 0.000 0.000 4132.0 1532.0 1.0

The value is \(a\) is close to zero. This implies that the model finds no association between age and happiness.

The pattern above is exactly what we should expect when condition on a collider. Here the collider is marriage status. It is a common consequence of age and happiness and therefore if we condition on it we are bound to induce a spurious association between the two causes

6.3.2 The hanted DAG

Collider may not be easily avoidable because they may be unmesasured causes. “Unmeasured” => Our DAG is haunted

Code 6.25

Here we will simulate another dataset using which we will try to infer the direct influene of both parents (P) and grandparents (G) on the educational achievement of childern (C)

It should be obvious that G -> P, P->C and G->C

N = 200  # number of grandparent-parent-child triads
b_GP = 1.0  # direct effect of G on P
b_GC = 0.0  # direct effect of G on C
b_PC = 1.0  # direct effect of P on C
b_U = 2.0  # direct effect of U on P and C

Code 6.26

Here is our “functional” (relationship) modeling

  • P is some function of G & U

  • C is some function of G, P & U

  • G & U are not functions of any other known variables

def grand_parent_sim():
    seed = tfp.util.SeedStream(1, salt="sim_grand_parent")
    U = 2 * tfd.Bernoulli(probs=0.5).sample(seed=seed(), sample_shape=N) - 1
    G = tfd.Normal(loc=0.0, scale=1.0).sample(seed=seed(), sample_shape=N)
    temp = b_GP * G + b_U * tf.cast(U, dtype=tf.float32)
    P = tfd.Normal(loc=temp, scale=1.0).sample(seed=seed())
    temp2 = b_PC * P + b_GC * G + b_U * tf.cast(U, dtype=tf.float32)
    C = tfd.Normal(loc=temp2, scale=1.0).sample(seed=seed())

    d = pd.DataFrame({"C": C.numpy(), "P": P.numpy(), "G": G.numpy(), "U": U.numpy()})

    return d


d = grand_parent_sim()

d
C P G U
0 3.512862 1.650976 -1.042326 1
1 -4.590997 -1.555232 0.232263 -1
2 4.886199 3.082321 1.986989 1
3 1.636396 1.243976 -1.034003 1
4 -8.797684 -5.349887 -3.269576 -1
... ... ... ... ...
195 3.745021 0.391891 0.254253 1
196 -4.513893 -2.212740 -1.100699 -1
197 -3.206512 -0.455612 0.602985 -1
198 1.082468 -0.115803 0.106212 1
199 -2.545109 0.838045 2.178708 -1

200 rows × 4 columns

tdf = dataframe_to_tensors("SimulatedEducation", d, ["P", "G", "C", "U"])

Code 6.27

Simple regression of C on P & G

def model_6_11(P, G):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="a"), sample_shape=1)
        )
        b_PC = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="b_PC"), sample_shape=1)
        )
        b_GC = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="b_GC"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = a[..., tf.newaxis] + b_PC[..., tf.newaxis] * P + b_GC[..., tf.newaxis] * G
        scale = sigma[..., tf.newaxis]

        h = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_11 = model_6_11(P=tdf.P, G=tdf.G)
NUM_CHAINS_FOR_6_11 = 2

init_state = [
    tf.zeros([NUM_CHAINS_FOR_6_11]),
    tf.ones([NUM_CHAINS_FOR_6_11]),
    tf.ones([NUM_CHAINS_FOR_6_11]),
    tf.ones([NUM_CHAINS_FOR_6_11]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_value = (tdf.C,)

posterior_6_11, trace_6_11 = sample_posterior(
    jdc_6_11,
    observed_data=observed_value,
    num_chains=NUM_CHAINS_FOR_6_11,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=4000,
    burnin=1000,
    params=["a", "bPC", "bGC", "sigma"],
)

az.summary(trace_6_11, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a 0.056 0.096 -0.084 0.224 0.001 0.002 8955.0 1666.0 1.0
bPC 1.818 0.045 1.746 1.889 0.001 0.000 4429.0 3198.0 1.0
bGC -0.897 0.101 -1.065 -0.750 0.001 0.001 7395.0 1906.0 1.0
sigma 1.382 0.072 1.263 1.491 0.003 0.002 791.0 1431.0 1.0

The inferred effect of parents looks too big, almost twice as large as it should be i.e. 1 vs 1.8

Some correlation between P & C is due to U and that’s a simple confound.

Interestingly model is confident (bGC) that the direct effect of grandparents is to hurt their grandkids

Code 6.28

Regression that conditions on U as well

def model_6_12(P, G, U):
    def _generator():
        a = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="a"), sample_shape=1)
        )
        b_PC = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="b_PC"), sample_shape=1)
        )
        b_GC = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="b_GC"), sample_shape=1)
        )
        b_U = yield Root(
            tfd.Sample(tfd.Normal(loc=0.0, scale=1.0, name="b_U"), sample_shape=1)
        )
        sigma = yield Root(
            tfd.Sample(tfd.Exponential(rate=1.0, name="sigma"), sample_shape=1)
        )

        mu = (
            a[..., tf.newaxis]
            + b_PC[..., tf.newaxis] * P
            + b_GC[..., tf.newaxis] * G
            + b_U[..., tf.newaxis] * U
        )

        scale = sigma[..., tf.newaxis]

        h = yield tfd.Independent(
            tfd.Normal(loc=mu, scale=scale, name="h"), reinterpreted_batch_ndims=1
        )

    return tfd.JointDistributionCoroutine(_generator, validate_args=False)


jdc_6_12 = model_6_12(P=tdf.P, G=tdf.G, U=tdf.U)
NUM_CHAINS_FOR_6_12 = 2

init_state = [
    tf.zeros([NUM_CHAINS_FOR_6_12]),
    tf.zeros([NUM_CHAINS_FOR_6_12]),
    tf.zeros([NUM_CHAINS_FOR_6_12]),
    tf.ones([NUM_CHAINS_FOR_6_12]),  # <-- note 1 instead of zero here. Necessary !!
    tf.ones([NUM_CHAINS_FOR_6_12]),
]

bijectors = [tfb.Identity(), tfb.Identity(), tfb.Identity(), tfb.Identity(), tfb.Exp()]

observed_value = (tdf.C,)

posterior_6_12, trace_6_12 = sample_posterior(
    jdc_6_12,
    observed_data=observed_value,
    num_chains=NUM_CHAINS_FOR_6_12,
    init_state=init_state,
    bijectors=bijectors,
    num_samples=4000,
    burnin=1000,
    params=["a", "bPC", "bGC", "bU", "sigma"],
)

az.summary(trace_6_12, hdi_prob=0.89)
mean sd hdi_5.5% hdi_94.5% mcse_mean mcse_sd ess_bulk ess_tail r_hat
a -0.060 0.074 -0.187 0.050 0.001 0.001 8809.0 2651.0 1.0
bPC 0.930 0.081 0.799 1.056 0.002 0.002 1056.0 1438.0 1.0
bGC 0.079 0.110 -0.101 0.251 0.003 0.002 1375.0 1774.0 1.0
bU 2.141 0.176 1.860 2.422 0.005 0.004 1029.0 1376.0 1.0
sigma 1.032 0.053 0.952 1.120 0.001 0.000 10612.0 1964.0 1.0

These are similar to the slopes we used to simulate (compare to the constants in Code 6.25)

This is an example of SIMPSON’s PARADOX. Usually Simpson’s paradox is presented in cases where adding the new predictor helps use. But in this case it misleads us.

6.4 Confronting confounding

We have now seen that in multiple regression if we control for wrong variables they can ruin the inference

6.4.1 Shutting the backdoor

Blocking all confounding paths between some predictor X & some outcome Y is known as shutting the BACKDOOR

In simple words, we do not want any spurrious corrleation sneaking in through a non-causal path

We will create the graphs for 4 elemental confounds.

Note that there is no corresponding code section in the book for drawing and hence these section are not marked with Codde 6.X heading

The Fork

dag_the_fork = CausalGraphicalModel(
    nodes=["X", "Y", "Z"], edges=[("Z", "X"), ("Z", "Y")]
)

pgm = daft.PGM()
coordinates = {"X": (0, 0), "Y": (2, 0), "Z": (1, 1)}
for node in dag_the_fork.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_the_fork.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_114_0.png

Fork is a classic confounder. Z is common cause of X & Y. Learning X tells us nothing about Y

The Pipe

dag_the_pipe = CausalGraphicalModel(
    nodes=["X", "Y", "Z"], edges=[("X", "Z"), ("Z", "Y")]
)

pgm = daft.PGM()
coordinates = {"X": (0, 0), "Y": (2, 0), "Z": (1, 0)}
for node in dag_the_pipe.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_the_pipe.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_117_0.png

If we condition on Z we block the path from X to Y. Recall the treatment & fungus experiment

The Collider

dag_the_collider = CausalGraphicalModel(
    nodes=["X", "Y", "Z"], edges=[("X", "Z"), ("Y", "Z")]
)

pgm = daft.PGM()
coordinates = {"X": (0, 0), "Y": (2, 0), "Z": (1, 0)}
for node in dag_the_collider.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_the_collider.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_120_0.png

Conditioning on collider i.e Z opens the path i.e info flows between X and Y

The Descendant

dag_the_descendant = CausalGraphicalModel(
    nodes=["X", "Y", "Z", "D"], edges=[("X", "Z"), ("Y", "Z"), ("Z", "D")]
)

pgm = daft.PGM()
coordinates = {"X": (0, 1), "Y": (2, 1), "Z": (1, 0), "D": (1, -1)}
for node in dag_the_descendant.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_the_descendant.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_123_0.png

Here is D is the descendant variable. if we condition on it then it will be similar to closig the pipe.

6.4.2 Two roads

DAG shown below (Code 6.29) has U which is unobserved variable

We are interested in X->Y but which observed covariates (A, B, C) do we need to add to the model ?

Code 6.29

dag_6_1 = CausalGraphicalModel(
    nodes=["X", "Y", "C", "U", "B", "A"],
    edges=[
        ("X", "Y"),
        ("U", "X"),
        ("A", "U"),
        ("A", "C"),
        ("C", "Y"),
        ("U", "B"),
        ("C", "B"),
    ],
)


# let's render the DAG as well (note - no code section for it in the book but is displayed)
# so instead of copying the picture we are generating it
pgm = daft.PGM()
coordinates = {
    "X": (0, 3),
    "Y": (2, 3),
    "U": (0, 1),
    "C": (2, 1),
    "A": (1, 0),
    "B": (1, 2),
}
for node in dag_6_1.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_6_1.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_127_0.png
all_adjustment_sets = dag_6_1.get_all_backdoor_adjustment_sets("X", "Y")
for s in all_adjustment_sets:
    if all(not t.issubset(s) for t in all_adjustment_sets if t != s):
        if s != {"U"}:
            print(s)
frozenset({'A'})
frozenset({'C'})

Read above results as -

Conditioning on either A or C would suffice

6.4.3 Backdoor waffles

Let’s consider the Waffle House & Divorce rate correlation and with the help of graph we can find the covariates we should include in our model

Code 6.30

  • S => Whether or not state is in southern USA

  • A => Median Age

  • M => Marriage Rate

  • W => Number of Waffle Houses

  • D => Divorce Rate

dag_6_2 = CausalGraphicalModel(
    nodes=["S", "A", "D", "M", "W"],
    edges=[
        ("S", "A"),
        ("A", "D"),
        ("S", "M"),
        ("M", "D"),
        ("S", "W"),
        ("W", "D"),
        ("A", "M"),
    ],
)

# let's render the DAG as well (note - no code section for it in the book but is displayed)
# so instead of copying the picture we are generating it

pgm = daft.PGM()
coordinates = {
    "S": (0, 0),
    "W": (2, 0),
    "A": (0, 2),
    "D": (2, 2),
    "M": (1, 1),
}
for node in dag_6_2.dag.nodes:
    pgm.add_node(node, node, *coordinates[node])
for edge in dag_6_2.dag.edges:
    pgm.add_edge(*edge)
pgm.render();
../_images/06_the_haunted_dag_and_the_causal_terror_132_0.png
all_adjustment_sets = dag_6_2.get_all_backdoor_adjustment_sets("W", "D")
for s in all_adjustment_sets:
    if all(not t.issubset(s) for t in all_adjustment_sets if t != s):
        print(s)
frozenset({'A', 'M'})
frozenset({'S'})

This shows that we could control either for S alone or A and M

Code 6.31

Above DAG is not satisfactory as it assumes that there are no unobserved confounds.

We can further test the graph by finding the conditional independencies

all_independencies = dag_6_2.get_all_independence_relationships()
for s in all_independencies:
    if all(
        t[0] != s[0] or t[1] != s[1] or not t[2].issubset(s[2])
        for t in all_independencies
        if t != s
    ):
        print(s)
('A', 'W', {'S'})
('S', 'D', {'W', 'A', 'M'})
('M', 'W', {'S'})

Read above result as :

  • Marriage Rate & Waffle Houses should be independent if conditioning on S

  • Median Age & Waffle Houses should be independent if conditioning on S

  • Being in south and Divorce Rate should be indpendent if condition on (W, M and A)