Encoding design matrices in Patsy

Some of us have seen the connections between ANOVA and linear regression (see here for more detailed explanation). In order to draw the equivalence between ANOVA and linear regression, we need a design matrix.

For instance if we have a series of observations A, B, C as follows

\[ \{A, B, C, A, A, B, C\}\]

If we wanted to reformulate this into ANOVA-style test, we can do a comparison between A vs B, and A vs C. We can encode that design matrix as follows

\begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 \\ \end{bmatrix}

In the first row of the matrix, only the entries with B are labeled (since we are doing a comparison between A and B. In the second row, only the entries with C are labeled. Since we have set A to implicitly be the reference, there is no row corresponding to A.

If we want to explicitly derive this design matrix in patsy, we can do it as follows

import pandas as pd
from patsy import dmatrix
formula = "category"
covariates = pd.DataFrame(
    {'category': ['A', 'B', 'C', 'A', 'A', 'B', 'C']})
design_matrix = dmatrix(formula, covariates, 
    return_type='dataframe')

This will give the follow design matrix

>>> dmatrix('t', x, return_type='dataframe')
   Intercept  t[T.B]  t[T.C]
0        1.0     0.0     0.0
1        1.0     1.0     0.0
2        1.0     0.0     1.0
3        1.0     0.0     0.0
4        1.0     0.0     0.0
5        1.0     1.0     0.0
6        1.0     0.0     1.0

What if we want to use B as a reference instead? We can modify the formula as follows

formula = "C(category, Treatment('B'))"

where C() just denotes a categorical variable and Treatment('B') indicates that the new reference will be set to B. This will generate the following design matrix

   Intercept  C(t, Treatment('B'))[T.A]  C(t, Treatment('B'))[T.C]
0        1.0                        1.0                        0.0
1        1.0                        0.0                        0.0
2        1.0                        0.0                        1.0
3        1.0                        1.0                        0.0
4        1.0                        1.0                        0.0
5        1.0                        0.0                        0.0
6        1.0                        0.0                        1.0

As you can see, now everything is encoded with B as a reference. Linear regression is by far the most powerful tool in statistics and machine learning. Getting the right design matrix specified can greatly increase your expressive power as a data analyst.

Probable Bug Bytes

Search This Blog

Encoding design matrices in Patsy

Comments

Post a Comment

Popular posts from this blog

ANCOM explained

Behind the scenes with BIOM tables