Biological data is stored in a variety of specialized databases, each designed to handle specific types of information. These databases are critical for bioinformatics research and enable scientists to access, analyze, and share biological information effectively.
Here's a breakdown of common storage methods:
Sequence Databases
These databases focus on storing nucleotide (DNA, RNA) and amino acid (protein) sequences.
- GenBank: A widely used, publicly accessible database maintained by the National Center for Biotechnology Information (NCBI) in the USA. It stores nucleotide sequences and associated information.
- European Nucleotide Archive (ENA): A comprehensive nucleotide sequence archive, encompassing data from Europe and beyond. It serves as a repository for raw sequencing data, assembled sequences, and functional annotation.
- Protein Data Bank (PDB): A repository for the 3D structural data of large biological molecules, such as proteins and nucleic acids. This data is typically obtained through X-ray crystallography, NMR spectroscopy, or electron microscopy.
Structure Databases
These are related to sequence databases, but specifically focus on the 3D structures of biomolecules.
- Protein Data Bank (PDB): As mentioned above, the PDB is the primary structure database.
Genome Databases
These databases contain complete or partial genome sequences of organisms.
- Examples include specialized databases for specific organisms (e.g., E. coli genome database) or broader databases encompassing multiple genomes.
Expression Databases
These databases store information about gene expression levels, often obtained from microarray or RNA-Seq experiments.
- GEO (Gene Expression Omnibus): A public repository at NCBI for gene expression data.
- ArrayExpress: A similar database maintained by the European Bioinformatics Institute (EBI).
Pathway Databases
These databases contain information about biological pathways and networks, such as metabolic pathways or signaling pathways.
- KEGG (Kyoto Encyclopedia of Genes and Genomes): A comprehensive database that integrates genomic, chemical, and systems information.
- Reactome: An open-source, curated, and peer-reviewed pathway database.
Data Formats
Biological data is stored in various formats, depending on the type of data and the database being used. Common formats include:
- FASTA: A text-based format for representing nucleotide or amino acid sequences.
- FASTQ: A text-based format for storing both nucleotide or amino acid sequences and their corresponding quality scores.
- GenBank format: A comprehensive format for storing sequence information, annotations, and metadata.
- XML (Extensible Markup Language): A flexible format for storing structured data.
- JSON (JavaScript Object Notation): A lightweight format for data exchange.
Storage Technologies
The underlying storage technologies vary, but often involve relational databases (e.g., MySQL, PostgreSQL) or NoSQL databases, along with file systems for storing large datasets like raw sequencing reads. Cloud-based storage solutions are also increasingly common.
In summary, biological data is stored in specialized databases that are tailored to the type of information being managed. These databases use standardized formats and storage technologies to ensure data accessibility, integrity, and interoperability.