Some of us have seen the connections between ANOVA and linear regression (see here for more detailed explanation). In order to draw the equivalence between ANOVA and linear regression, we need a design matrix.
For instance if we have a series of observations A, B, C as follows
\[ \{A, B, C, A, A, B, C\}\]
If we wanted to reformulate this into ANOVA-style test, we can do a comparison between A vs B, and A vs C. We can encode that design matrix as follows
In the first row of the matrix, only the entries with B are labeled (since we are doing a comparison between A and B. In the second row, only the entries with C are labeled. Since we have set A to implicitly be the reference, there is no row corresponding to A.
If we want to explicitly derive this design matrix in patsy, we can do it as follows
import pandas as pd from patsy import dmatrix formula = "category" covariates = pd.DataFrame( {'category': ['A', 'B', 'C', 'A', 'A', 'B', 'C']}) design_matrix = dmatrix(formula, covariates, return_type='dataframe')This will give the follow design matrix
>>> dmatrix('t', x, return_type='dataframe') Intercept t[T.B] t[T.C] 0 1.0 0.0 0.0 1 1.0 1.0 0.0 2 1.0 0.0 1.0 3 1.0 0.0 0.0 4 1.0 0.0 0.0 5 1.0 1.0 0.0 6 1.0 0.0 1.0What if we want to use B as a reference instead? We can modify the formula as follows
formula = "C(category, Treatment('B'))"where C() just denotes a categorical variable and Treatment('B') indicates that the new reference will be set to B. This will generate the following design matrix
Intercept C(t, Treatment('B'))[T.A] C(t, Treatment('B'))[T.C] 0 1.0 1.0 0.0 1 1.0 0.0 0.0 2 1.0 0.0 1.0 3 1.0 1.0 0.0 4 1.0 1.0 0.0 5 1.0 0.0 0.0 6 1.0 0.0 1.0As you can see, now everything is encoded with B as a reference. Linear regression is by far the most powerful tool in statistics and machine learning. Getting the right design matrix specified can greatly increase your expressive power as a data analyst.
Comments
Post a Comment