Today, I'll be covering the BIOM file format, a standardized file format for storing sequence counts in samples. This file format is typically used in the biological sciences, most notably in amplicon sequencing technologies, such as 16S sequencing.
For those of you that aren't as familiar with these technologies. When we conduct survey studies, we like to get a broad overview of the microbes that are living within a raw sample. But we don't need to sequence the entire bacteria's genome to identify what the bacteria is. We can just a sequence a housekeeping gene that every bacteria as - the 16S ribosome.
Its a similar strategy deployed in court. When DNA evidence is presented in the court room, only a tiny, tiny portion of an individuals DNA is actually required to uniquely identify that person.
But moving on.
The BIOM file format was originally designed to store counts of 16S sequences across samples, but it has grown to become a more generalized file format that can store any type of counts, whether it be metabolites, genomes, genes, etc.
You might be thinking - why on Earth would you be reinventing a file format, when there are so many existing table formats accomplishing the exact same thing?!?
The reasoning is two-fold
1. These matrices are incredibly sparse. Bacteria tend to live in very specific environments, so you end up with a lot of samples not sharing a whole lot of bacteria. Your average biom table for a 16S study has somewhere between 10-20% non-zeros. The same is true to many other biological datsets. If we store all of the zeroes explicitly on disk, that's a down right waste of space, and storage would increase 10x fold. Forget about attaching these tables in emails.
2. The metadata associated with samples and features (i.e. bacteria) are complex. For samples, you need information like pH, temperature, sample_type, etc. But for bacteria, there is a whole sleuth of information, such as taxonomy.
So you are effectively dealing with a labeled matrix, with 2 separate dimension for metadata. BIOM was designed to address this very specific problem of storing all of this information into a single file.
What does this object look like underneath the hood?
On disk, it can be either represented as a tab-delimited file, a json file or a hdf5 file. It is worth checking out the biom file format conversions here.
In memory, it really just boils down to numpy arrays and scipy sparse matrices.
To get started using this file-format, I'd recommend booting up a conda environment. See my previous tutorial on how to build a conda environment. In your conda environment, you can pip install the biom format as follows
Once you have it installed, you can jump right into Python. I have a little function below designed to unpack your biom table into two pandas dataframes. Here, we won't be worrying about sample metadata. But we will be extracting the bacterial metadata and the sample counts.
Let's take this function for a spin, and see it in action.
I'm going to use an example in the biom documentation
We'll start by loading the necessary modules.
We'll create a biom object for this example. If you are reading a biom object from a file, check out the load_table function
The biom object by itself is hard to manipulate, but not for long. We can unpack the important information as follows
So now, you can easily manipulate your data as pandas dataframes. We just needed to access the internals of the biom object to get to this step.
For those of you that aren't as familiar with these technologies. When we conduct survey studies, we like to get a broad overview of the microbes that are living within a raw sample. But we don't need to sequence the entire bacteria's genome to identify what the bacteria is. We can just a sequence a housekeeping gene that every bacteria as - the 16S ribosome.
Its a similar strategy deployed in court. When DNA evidence is presented in the court room, only a tiny, tiny portion of an individuals DNA is actually required to uniquely identify that person.
But moving on.
The BIOM file format was originally designed to store counts of 16S sequences across samples, but it has grown to become a more generalized file format that can store any type of counts, whether it be metabolites, genomes, genes, etc.
You might be thinking - why on Earth would you be reinventing a file format, when there are so many existing table formats accomplishing the exact same thing?!?
The reasoning is two-fold
1. These matrices are incredibly sparse. Bacteria tend to live in very specific environments, so you end up with a lot of samples not sharing a whole lot of bacteria. Your average biom table for a 16S study has somewhere between 10-20% non-zeros. The same is true to many other biological datsets. If we store all of the zeroes explicitly on disk, that's a down right waste of space, and storage would increase 10x fold. Forget about attaching these tables in emails.
2. The metadata associated with samples and features (i.e. bacteria) are complex. For samples, you need information like pH, temperature, sample_type, etc. But for bacteria, there is a whole sleuth of information, such as taxonomy.
So you are effectively dealing with a labeled matrix, with 2 separate dimension for metadata. BIOM was designed to address this very specific problem of storing all of this information into a single file.
What does this object look like underneath the hood?
On disk, it can be either represented as a tab-delimited file, a json file or a hdf5 file. It is worth checking out the biom file format conversions here.
In memory, it really just boils down to numpy arrays and scipy sparse matrices.
To get started using this file-format, I'd recommend booting up a conda environment. See my previous tutorial on how to build a conda environment. In your conda environment, you can pip install the biom format as follows
pip install biom-format
Once you have it installed, you can jump right into Python. I have a little function below designed to unpack your biom table into two pandas dataframes. Here, we won't be worrying about sample metadata. But we will be extracting the bacterial metadata and the sample counts.
def convert_biom_to_pandas(table): """ Unpacks biom table into two pandas dataframes. The first dataframe will contain the count information for features and samples. The second datafram will contain taxonomy information for all of the OTUs. Parameters ---------- table : biom.Table Returns ------- pd.DataFrame Contingency table of counts where samples correspond to rows and columns correspond to features (i.e. OTUs) pd.DataFrame A mapping of OTU names to taxonomic ids """ feature_table = pd.DataFrame(np.array(table.matrix_data.todense()).T, index=table.ids(axis='sample'), columns=table.ids(axis='observation')) feature_ids = table.ids(axis='observation') mapping = {i: table.metadata(id=i, axis='observation')['taxonomy'] for i in feature_ids} # modify below as necessary. # There are typically 7 levels of taxonomy. taxonomy = pd.DataFrame(mapping, index=['kingdom', 'phylum']).T
Let's take this function for a spin, and see it in action.
I'm going to use an example in the biom documentation
We'll start by loading the necessary modules.
>>> import numpy as np >>> import pandas as pd >>> from biom.table import Table
We'll create a biom object for this example. If you are reading a biom object from a file, check out the load_table function
>>> data = np.arange(40).reshape(10, 4) >>> sample_ids = ['S%d' % i for i in range(4)] >>> observ_ids = ['O%d' % i for i in range(10)] >>> sample_metadata = [{'environment': 'A'}, {'environment': 'B'}, ... {'environment': 'A'}, {'environment': 'B'}] >>> observ_metadata = [{'taxonomy': ['Bacteria', 'Firmicutes']}, ... {'taxonomy': ['Bacteria', 'Firmicutes']}, ... {'taxonomy': ['Bacteria', 'Proteobacteria']}, ... {'taxonomy': ['Bacteria', 'Proteobacteria']}, ... {'taxonomy': ['Bacteria', 'Proteobacteria']}, ... {'taxonomy': ['Bacteria', 'Bacteroidetes']}, ... {'taxonomy': ['Bacteria', 'Bacteroidetes']}, ... {'taxonomy': ['Bacteria', 'Firmicutes']}, ... {'taxonomy': ['Bacteria', 'Firmicutes']}, ... {'taxonomy': ['Bacteria', 'Firmicutes']}] >>> table = Table(data, observ_ids, sample_ids, observ_metadata, ... sample_metadata, table_id='Example Table')
The biom object by itself is hard to manipulate, but not for long. We can unpack the important information as follows
>>> count_table, taxonomy = convert_biom_to_pandas(table)And we now have two separate pandas dataframes. Check it out!
>>> taxonomy kingdom phylum O0 Bacteria Firmicutes O1 Bacteria Firmicutes O2 Bacteria Proteobacteria O3 Bacteria Proteobacteria O4 Bacteria Proteobacteria O5 Bacteria Bacteroidetes O6 Bacteria Bacteroidetes O7 Bacteria Firmicutes O8 Bacteria Firmicutes O9 Bacteria Firmicutes
>>> count_table O0 O1 O2 O3 O4 O5 O6 O7 O8 O9 S0 0.0 4.0 8.0 12.0 16.0 20.0 24.0 28.0 32.0 36.0 S1 1.0 5.0 9.0 13.0 17.0 21.0 25.0 29.0 33.0 37.0 S2 2.0 6.0 10.0 14.0 18.0 22.0 26.0 30.0 34.0 38.0 S3 3.0 7.0 11.0 15.0 19.0 23.0 27.0 31.0 35.0 39.0
So now, you can easily manipulate your data as pandas dataframes. We just needed to access the internals of the biom object to get to this step.
Fabulous post admin, it was too good and helpful. Waiting for more updates.
ReplyDeleteccna course in Chennai
Python Training in Chennai
R Programming Training in Chennai
AWS Training in Chennai
DevOps Training in Chennai
Angularjs Training in Chennai
RPA Training in Chennai
Blue Prism Training in Chennai
Data Science Course in Chennai
Wow it is really wonderful and awesome thus it is very much useful for me to understand many concepts and helped me a lot. it is really explainable very well and i got more information from your blog.
ReplyDeleteCyber Security Training Course in Chennai | Certification | Cyber Security Online Training Course | Ethical Hacking Training Course in Chennai | Certification | Ethical Hacking Online Training Course |
CCNA Training Course in Chennai | Certification | CCNA Online Training Course | RPA Robotic Process Automation Training Course in Chennai | Certification | RPA Training Course Chennai | SEO Training in Chennai | Certification | SEO Online Training Course