Skip to main content

Behind the scenes with BIOM tables

Today, I'll be covering the BIOM file format, a standardized file format for storing sequence counts in samples.  This file format is typically used in the biological sciences, most notably in amplicon sequencing technologies, such as 16S sequencing.

For those of you that aren't as familiar with these technologies.  When we conduct survey studies, we like to get a broad overview of the microbes that are living within a raw sample.  But we don't need to sequence the entire bacteria's genome to identify what the bacteria is.  We can just a sequence a housekeeping gene that every bacteria as - the 16S ribosome.

Its a similar strategy deployed in court.  When DNA evidence is presented in the court room, only a tiny, tiny portion of an individuals DNA is actually required to uniquely identify that person.

But moving on.

The BIOM file format was originally designed to store counts of 16S sequences across samples, but it has grown to become a more generalized file format that can store any type of counts, whether it be metabolites, genomes, genes, etc.

You might be thinking - why on Earth would you be reinventing a file format, when there are so many existing table formats accomplishing the exact same thing?!?

The reasoning is two-fold

1. These matrices are incredibly sparse.  Bacteria tend to live in very specific environments, so you end up with a lot of samples not sharing a whole lot of bacteria.  Your average biom table for a 16S study has somewhere between 10-20% non-zeros.  The same is true to many other biological datsets.  If we store all of the zeroes explicitly on disk, that's a down right waste of space, and storage would increase 10x fold.  Forget about attaching these tables in emails.
2. The metadata associated with samples and features (i.e. bacteria) are complex.  For samples, you need information like pH, temperature, sample_type, etc.  But for bacteria, there is a whole sleuth of information, such as taxonomy.

So you are effectively dealing with a labeled matrix, with 2 separate dimension for metadata.  BIOM was designed to address this very specific problem of storing all of this information into a single file.

What does this object look like underneath the hood?

On disk, it can be either represented as a tab-delimited file, a json file or a hdf5 file.  It is worth checking out the biom file format conversions here.

In memory, it really just boils down to numpy arrays and scipy sparse matrices.

To get started using this file-format, I'd recommend booting up a conda environment.  See my previous tutorial on how to build a conda environment. In your conda environment, you can pip install the biom format as follows

pip install biom-format

Once you have it installed, you can jump right into Python.  I have a little function below designed to unpack your biom table into two pandas dataframes.  Here, we won't be worrying about sample metadata.  But we will be extracting the bacterial metadata and the sample counts.

def convert_biom_to_pandas(table):
    """ Unpacks biom table into two pandas dataframes.
    
    The first dataframe will contain the count information for 
    features and samples. The second datafram will contain taxonomy 
    information for all of the OTUs.
    
    Parameters
    ----------
    table : biom.Table
    
    Returns
    -------
    pd.DataFrame
        Contingency table of counts where samples correspond 
        to rows and columns correspond to features (i.e. OTUs)
    pd.DataFrame
        A mapping of OTU names to taxonomic ids
    """

    feature_table = pd.DataFrame(np.array(table.matrix_data.todense()).T,
                             index=table.ids(axis='sample'),
                             columns=table.ids(axis='observation'))
    feature_ids = table.ids(axis='observation')
    mapping = {i: table.metadata(id=i, axis='observation')['taxonomy'] for i in feature_ids}
    # modify below as necessary.  
    # There are typically 7 levels of taxonomy.
    taxonomy = pd.DataFrame(mapping, 
                            index=['kingdom', 'phylum']).T

Let's take this function for a spin, and see it in action.

I'm going to use an example in the biom documentation

We'll start by loading the necessary modules.
>>> import numpy as np
>>> import pandas as pd
>>> from biom.table import Table

We'll create a biom object for this example. If you are reading a biom object from a file, check out the load_table function

>>> data = np.arange(40).reshape(10, 4)
>>> sample_ids = ['S%d' % i for i in range(4)]
>>> observ_ids = ['O%d' % i for i in range(10)]
>>> sample_metadata = [{'environment': 'A'}, {'environment': 'B'},
...                    {'environment': 'A'}, {'environment': 'B'}]
>>> observ_metadata = [{'taxonomy': ['Bacteria', 'Firmicutes']},
...                    {'taxonomy': ['Bacteria', 'Firmicutes']},
...                    {'taxonomy': ['Bacteria', 'Proteobacteria']},
...                    {'taxonomy': ['Bacteria', 'Proteobacteria']},
...                    {'taxonomy': ['Bacteria', 'Proteobacteria']},
...                    {'taxonomy': ['Bacteria', 'Bacteroidetes']},
...                    {'taxonomy': ['Bacteria', 'Bacteroidetes']},
...                    {'taxonomy': ['Bacteria', 'Firmicutes']},
...                    {'taxonomy': ['Bacteria', 'Firmicutes']},
...                    {'taxonomy': ['Bacteria', 'Firmicutes']}]
>>> table = Table(data, observ_ids, sample_ids, observ_metadata,
...               sample_metadata, table_id='Example Table')

The biom object by itself is hard to manipulate, but not for long.  We can unpack the important information as follows


>>> count_table, taxonomy = convert_biom_to_pandas(table)
And we now have two separate pandas dataframes. Check it out!
>>> taxonomy
kingdom          phylum
O0  Bacteria      Firmicutes
O1  Bacteria      Firmicutes
O2  Bacteria  Proteobacteria
O3  Bacteria  Proteobacteria
O4  Bacteria  Proteobacteria
O5  Bacteria   Bacteroidetes
O6  Bacteria   Bacteroidetes
O7  Bacteria      Firmicutes
O8  Bacteria      Firmicutes
O9  Bacteria      Firmicutes

>>> count_table
     O0   O1    O2    O3    O4    O5    O6    O7    O8    O9
S0  0.0  4.0   8.0  12.0  16.0  20.0  24.0  28.0  32.0  36.0
S1  1.0  5.0   9.0  13.0  17.0  21.0  25.0  29.0  33.0  37.0
S2  2.0  6.0  10.0  14.0  18.0  22.0  26.0  30.0  34.0  38.0
S3  3.0  7.0  11.0  15.0  19.0  23.0  27.0  31.0  35.0  39.0

So now, you can easily manipulate your data as pandas dataframes.  We just needed to access the internals of the biom object to get to this step.

Comments

Post a Comment

Popular posts from this blog

ANCOM explained

In case you have not heard, ANCOM is another differential abundance test, designed specifically for tweezing out differentially abundance bacteria between groups.  Now, note that there are a ton of differential abundance techniques out there.  And one might ask why are there so many people focused on this seemingly simple problem. It turns out that this problem is actually impossible.   And this is rooted into the issue of relative abundances.  A change of 1 species between samples can be also explained by the change of all of the other species between samples.   Let's take a look at simple, concrete example. Here we have ten species, and 1 species doubles after the first time point.  If we know the original abundances of this species, it's pretty clear that species 1 doubled.  However, if we can only obtain the proportions of species within the environment, the message isn't so clear. Above are the proportions of the species in the exact same environment

Encoding design matrices in Patsy

Some of us have seen the connections between ANOVA and linear regression (see here  for more detailed explanation).  In order to draw the equivalence between ANOVA and linear regression, we need a design matrix. For instance if we have a series of observations A, B, C as follows \[ \{A, B, C, A, A, B, C\}\] If we wanted to reformulate this into ANOVA-style test, we can do a comparison between A vs B, and A vs C.  We can encode that design matrix as follows \begin{bmatrix} 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 \\ \end{bmatrix} In the first row of the matrix, only the entries with B are labeled (since we are doing a comparison between A and B. In the second row, only the entries with C are labeled.  Since we have set A to implicitly be the reference, there is no row corresponding to A. If we want to explicitly derive this design matrix in patsy, we can do it as follows import pa