Some of us have seen the connections between ANOVA and linear regression (see here for more detailed explanation). In order to draw the equivalence between ANOVA and linear regression, we need a design matrix. For instance if we have a series of observations A, B, C as follows \[ \{A, B, C, A, A, B, C\}\] If we wanted to reformulate this into ANOVA-style test, we can do a comparison between A vs B, and A vs C. We can encode that design matrix as follows \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 \\ \end{bmatrix} In the first row of the matrix, only the entries with B are labeled (since we are doing a comparison between A and B. In the second row, only the entries with C are labeled. Since we have set A to implicitly be the reference, there is no row corresponding to A. If we want to explicitly derive this design matrix in patsy, we can do it as follows import pa
Today, I'll be covering the BIOM file format , a standardized file format for storing sequence counts in samples. This file format is typically used in the biological sciences, most notably in amplicon sequencing technologies, such as 16S sequencing. For those of you that aren't as familiar with these technologies. When we conduct survey studies, we like to get a broad overview of the microbes that are living within a raw sample. But we don't need to sequence the entire bacteria's genome to identify what the bacteria is. We can just a sequence a housekeeping gene that every bacteria as - the 16S ribosome. Its a similar strategy deployed in court. When DNA evidence is presented in the court room, only a tiny, tiny portion of an individuals DNA is actually required to uniquely identify that person. But moving on. The BIOM file format was originally designed to store counts of 16S sequences across samples, but it has grown to become a more generalized file fo