Skip to main content

Encoding design matrices in Patsy


Some of us have seen the connections between ANOVA and linear regression (see here for more detailed explanation).  In order to draw the equivalence between ANOVA and linear regression, we need a design matrix.

For instance if we have a series of observations A, B, C as follows
\[ \{A, B, C, A, A, B, C\}\]

If we wanted to reformulate this into ANOVA-style test, we can do a comparison between A vs B, and A vs C.  We can encode that design matrix as follows

\begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 \\ \end{bmatrix}

In the first row of the matrix, only the entries with B are labeled (since we are doing a comparison between A and B. In the second row, only the entries with C are labeled.  Since we have set A to implicitly be the reference, there is no row corresponding to A.

If we want to explicitly derive this design matrix in patsy, we can do it as follows

import pandas as pd
from patsy import dmatrix
formula = "category"
covariates = pd.DataFrame(
    {'category': ['A', 'B', 'C', 'A', 'A', 'B', 'C']})
design_matrix = dmatrix(formula, covariates, 
    return_type='dataframe')
This will give the follow design matrix
>>> dmatrix('t', x, return_type='dataframe')
   Intercept  t[T.B]  t[T.C]
0        1.0     0.0     0.0
1        1.0     1.0     0.0
2        1.0     0.0     1.0
3        1.0     0.0     0.0
4        1.0     0.0     0.0
5        1.0     1.0     0.0
6        1.0     0.0     1.0
What if we want to use B as a reference instead? We can modify the formula as follows
formula = "C(category, Treatment('B'))"
where C() just denotes a categorical variable and Treatment('B') indicates that the new reference will be set to B. This will generate the following design matrix
   Intercept  C(t, Treatment('B'))[T.A]  C(t, Treatment('B'))[T.C]
0        1.0                        1.0                        0.0
1        1.0                        0.0                        0.0
2        1.0                        0.0                        1.0
3        1.0                        1.0                        0.0
4        1.0                        1.0                        0.0
5        1.0                        0.0                        0.0
6        1.0                        0.0                        1.0
As you can see, now everything is encoded with B as a reference. Linear regression is by far the most powerful tool in statistics and machine learning. Getting the right design matrix specified can greatly increase your expressive power as a data analyst.

Comments

Popular posts from this blog

ANCOM explained

In case you have not heard, ANCOM is another differential abundance test, designed specifically for tweezing out differentially abundance bacteria between groups.  Now, note that there are a ton of differential abundance techniques out there.  And one might ask why are there so many people focused on this seemingly simple problem. It turns out that this problem is actually impossible.   And this is rooted into the issue of relative abundances.  A change of 1 species between samples can be also explained by the change of all of the other species between samples.   Let's take a look at simple, concrete example. Here we have ten species, and 1 species doubles after the first time point.  If we know the original abundances of this species, it's pretty clear that species 1 doubled.  However, if we can only obtain the proportions of species within the environment, the message isn't so clear. Above are the proportions of the species in the e...

Installing qiime through conda

First set of posts on conda.  Its becoming increasingly difficult to sift through my inbox to find all of the proper commands, so here it goes :) Anyways, conda has proven to be quite a powerful tool.  It enables _all_ of the capabilities provided by virtualenv, plus more.  It can install C libraries such as hdf5, is my personal go-to whenever I'm installing software on a new system.  Heck you can even install different versions of Python - how cool is that? That being said, the fastest way I know of to install qiime on a new cluster is through conda. To get started, you'll first want to install Miniconda .  The reason way is because you want a minimal conda install, otherwise you'll end up breaking some of the dependencies required by qiime. After getting into your root directory, you can download python (for python 3) for linux wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh If you have a mac you can use the following...