Coarsened Exact Matching

💡 Model

CEM is a nonparametric data preprocessing algorithm in causal inference that has a broad applicability to observational data. With CEM, you can construct your observational data into ‘quasi’ experimental data easily, removing the baseline differences and making your treated and control groups comparable.

When conducting CEM, each sample is represented by confounder properties coarsened to discrete values using a coarsening or binning strategy. Thus each sample is given a “BIN Signature” and samples with exactly the same signature will be matched in the same group \(s \in S\). We denote the treated units by \(T^s\) in group \(s\) and the number of treated units in the group by \(m_{T}^s\). Similarly for the control units, that is, \(C^s\) and \(m_{C}^s\). \(m_{T}\) and \(m_{C}\) are the number of matched units for treated and controls respectively. To each matched unit \(i\) in stratum \(s\), CEM assigns the following weights, and this weight will be used estimating the average treatment effect (Iacus, King, & Porro, 2012).

\[\begin{split}w_{i} = \begin{cases} 1, i \in T^s \\ \frac{m_{C}}{m_{T}} \frac{m_{T}^s}{m_{C}^s} , i \in C^s \end{cases}\end{split}\]

The matched exactly balanced data indicates that there is no need to control for X, as it is irelevant to the treatment variable. Therefore, estimating the causal effect can be done simply by calculating the difference in means using the matched data. However, if your matched data is approximately balanced, it is necessary to control for X with a model (Ho, Imai, King, & Stuart 2007). More details can be found in the inference part, and you can check the next section about how to measure the balance of your data.

Asumptions

  • Conditioncal on \(X\), the treatment variable is independent with the potential outcomes.

\[T_{i} \perp\!\!\!\perp {(Y_i(0), Y_i(1))} \mid X\]

Advantages

  • Easy to understand and great interpretability.

  • No assumptions about the data generation process.

  • Mitigating the model dependency, bias, and inefficiency of your treatment effect estimation (Ho, Imai, King, & Stuart 2007).

Limitations

  • There might be omitted confounders, which can reduce the precision of estimated treatment effect, and even gives a contradictory conclusion. This effect could be assessed with sensitivity analysis.

  • Choosing the coarsening setting appropriately is the primary issue to consider when running CEM (Iacus, King, & Porro, 2012). You can set the coarsening setting manually based on your understanding of your covariates, or use the coarsening parameters fine-tune function in this package.

  • CEM possesses the characteristic of having a monotonic imbalance bound property, making it one of the simplest methods. However, it is also possible to enhance and customize other methods for specific applications by leveraging established techniques within each CEM group (Iacus, King, & Porro, 2012). In this package, you can conduct 1-k matching based on the CEM. Details can be found in the folowing tutorial.

⌨️ Example

Import and Data Preperation

First, you should import the class cem and the function data_generation from our package CEM_LinearInf.

from CEM_LinearInf.cem import cem
from CEM_LinearInf.data_generation import data_generation

You can generate the dataset with data_generation function directly, in which you can set the sample size, treatment probability, average treatment effect, and parameters for covariates and confounders. Please note that here the result variable \(Y\) is linearly dependent with control variables \(X\) and treatment variable \(T\).

df = data_generation(n=10000, # sample size
                 p=0.2,   # P(T=1)
                 att=3,   # True average treatment effect on treated
                 x_cont=[0,1,6], # Generate 6 continuous variables X following the normal distribution N(0, 1).
                 x_cate=[2, 4, 4], # Generate 3 catigorical variables X with 2, 4, 4 categories respectively.
                 con_x=[(0, 3), (1, -2), (2, 1), (6, 2.5), (8, 1.5)] # X1, X2, X3, X7, X9 are confounders and
                 )                                                   # their effect on T are 3, -2, 1, 2.5, 1.5 resectively.

df.head()
df

X1

X2

X3

X4

X5

X6

X7

X8

X9

T

Y

0

-1.1272267236947700

-1.581226302520010

-0.6556658772290540

-0.1900801925850620

0.9313871800949950

-0.5887463208662830

0.0

0.0

3.0

1

-3.7888544062551600

1

-0.35029138960327400

-0.6968508106596820

1.8267061637216000

-0.7550024347237330

1.4921923242399800

0.29896435202770100

0.0

3.0

3.0

0

10.432177337311000

2

-1.1078800570013400

-0.004072902846078520

-1.8166437050716900

-0.25658910105683600

0.5646220154584350

-0.3831915543092030

0.0

2.0

2.0

0

13.527537670039700

3

-1.4395437807621100

0.11519853425235600

-0.12404271565668400

1.2879614172670000

0.17803779978719500

0.9650402740405890

0.0

0.0

1.0

0

2.4135460065982300

4

0.39763579158597000

-0.012467404245697800

0.6417354626343980

-1.049706238799090

-0.5812176102222940

-1.4328163941277300

0.0

0.0

1.0

1

-4.718393342784050

5

-1.3734540788802600

1.4138490650861700

1.4805845809239700

-0.19862072639993400

0.43125358190456600

0.37940900915644100

0.0

0.0

1.0

0

7.057909511852510

Fit CEM Model

Then you should create your own cem , giving it your dataframe, column names of confounders, continuous confounders, result variable Y and treatment variable T.

confounder_cols = ['X1','X2','X3','X7', 'X9']
cont_confounder_cols = ['X1','X2','X3']
my_cem = cem(df = df, # dataframe to be matched
            confounder_cols = confounder_cols, # list of confounders' column names
            cont_confounder_cols = cont_confounder_cols, # list of continuous confounders' column names
            col_y = 'Y', # column name of result variable
            col_t = 'T' # column name of treatment variable
            )

cem could give you the summary of your dataset.

my_cem.summary()
Descriptive Statistics of the dataframe:

                X1           X2           X3           X4           X5  \
count  10,000.0000  10,000.0000  10,000.0000  10,000.0000  10,000.0000
mean       -0.0160       0.0213      -0.0013       0.0001      -0.0144
std         0.9963       1.0065       0.9986       0.9955       0.9841
min        -3.5670      -4.2668      -4.7132      -4.0806      -3.4952
25%        -0.6865      -0.6595      -0.6691      -0.6720      -0.6779
50%        -0.0231       0.0113      -0.0048      -0.0038      -0.0129
75%         0.6568       0.7026       0.6661       0.6730       0.6519
max         3.6061       3.7632       4.1706       3.9502       3.7476

                X6           X7           X8           X9            T  \
count  10,000.0000  10,000.0000  10,000.0000  10,000.0000  10,000.0000
mean       -0.0016       0.4954       1.5062       1.4953       0.1533
std         1.0056       0.5000       1.1230       1.1172       0.3603
min        -3.8068       0.0000       0.0000       0.0000       0.0000
25%        -0.6853       0.0000       1.0000       0.0000       0.0000
50%        -0.0034       0.0000       1.0000       2.0000       0.0000
75%         0.6783       1.0000       3.0000       2.0000       0.0000
max         4.5216       1.0000       3.0000       3.0000       1.0000

                Y
count  10,000.0000
mean        6.4324
std         9.5579
min       -29.9051
25%        -0.0168
50%         6.3347
75%        12.9615
max        43.8527

Control group vs. Experimental group

n_samples    mean_Y
0       8467  6.277839
1       1533  7.286350

T-test of Experimental group Y and Control group Y

att estimate (p-value): 1.0085(0.0001)
The difference between Experimental group Y and Control group Y is significant, and the difference is 1.0085.

Then we can try matching your dataset using match function with default parameters.

my_cem.match()

After the default coarsened exact matching, 82.84% treated samples are matched.

Matching result

    all  matched  propotion
0  8467     3338     0.3942
1  1533     1270     0.8284

Moreover, we can customize our coarsen schema to optimize our matching result. The matched result with a suitable coarsen schema will have smaller L1 imbalance score and more matched samples.

  • Method 1:

    You can input a schema dictionary indicating how to coarsen each continuous confounders X if you have a thorough understanding on your dataset.

    The following cutting method can be chosen.

    • cut: Bin values into discrete intervals with the same length.

    • qcut: Discretize variable into equal-sized buckets based on rank or based on sample quantiles.

    • struges: Bin values into \(k\) discrete intervals with the same length according to the \(Sturges' rule\).

\[k = \lceil log_2n + 1 \rceil\]
my_cem.match(schema = {'cut': 4})
  • Method 2:

    You can also use the tunning_schema function to help you tune the coarsen schema automatically.

l1, schema = my_cem.tunning_schema(step = 4)
my_cem.match(schema = schema)
Matching result

    all  matched  propotion
0  8467     5763     0.6806
1  1533     1431     0.9335

CEM combined with other Matching methods

It has been declared that leveraging established techniques within each CEM group can further improve the in-group balance. Inspired by the K Nearest Neighbor Algorithm, in the same strata, a treated sample will be matched with \(k\) controled samples having nearest distance or propensity score with it.

my_cem_k2k = cem(df, confounder_cols, cont_confounder_cols)
my_cem_k2k.match(k2k_ratio = 1, dist = 'psm')
# my_cem_k2k.match(k2k_ratio = 1, dist = 'euclidean')
# my_cem_k2k.match(k2k_ratio = 1, dist = 'mahalanobis')
Matching result

    all  matched  propotion
0  8467     1270     0.1500
1  1533     1270     0.8284

⭐️ Reference

  • Ho, D., Imai, K., King, G., & Stuart, E. (2007). Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference. Political Analysis, 15, 199–236. Retrieved from https://tinyurl.com/y4xtv32s

  • Ho D, Imai K, King G, Stuart E (2011). “MatchIt: Nonparametric Preprocessing for Parametric Causal Inference.” Journal of Statistical Software, 42(8), 1–28. https://doi.org/10.18637/jss.v042.i08.

  • Iacus, S. M., King, G., & Porro, G. (2012). Causal Inference Without Balance Checking: Coarsened Exact Matching. Political Analysis, 20(1), 1–24. Retrieved from https://tinyurl.com/yydq5enf