SO
Introduction to the classification
The Sequence Ontology (SO, http://www.sequenceontology.org/) is a structured controlled vocabulary that aims at facilitating the exchange, analysis and organization of genomic data by humans and machines alike. For its development, the ontology rallied contributions from different communities such as the Generic Model Organism Database (GMOD), the Sanger Institute and the European Bioinformatics Institute (EBI) group as well as from widely used model organism databases (e.g., WormBase, FlyBase, Mouse Genome Informatics group).
The SO is designed as a tool that unifies the way in which sequence annotations are described. To achieve this goal, it relies on several terms that describe parts of the sequence annotations, as well as the relationships between these. The use of a common controlled vocabulary during the annotation process enables the comparisons between annotations from different project and ease its downstream analysis.
Information for use in data science
Each term provided by the Sequence Ontology (SO) is linked to an accession (or concept identifier), a human-readable definition and its source as seen in Figure 1A. These not only describe common genomic annotations (e.g., exon, intron, binding_site), but also experimental features (e.g., microarray_oligo, smFISH_probe) thus allowing to connect the sequence and its biology to the results of an experiment.
In the SO, the terms are linked by a set of relationships that allow to perform logical inferences about the annotated data. In addition, the definition of clear relationships between terms introduce restrictions in the way the annotation is performed.
For example, in SO an ‘Exon’ is part of a ‘Transcript’ whereas an ‘Intron’ is part of a ‘Primary transcript’ which is a ‘Transcript’, as seen in Figure 1B. It is however incorrect to state that an ‘Intron’ is part of a ‘Transcript’. These clear relationships ultimately help to maintain consistency across different projects annotations.
Figure 1. Terms and their relationships in SO. A) Each term is identified by its accession number, a description, synonyms and eventual cross references, as well as all direct parents and children terms linked to it. B) The relationship between terms (arrows) are labelled as follows: (i) indicates an “is_a” relationship. (P) indicates a “part_of” relationship. (d) indicates a ‘derived_from” relationship (source: Eilbeck, K. et al. Genome Biol 6, R44 (2005)).
The ontology relies on three types of relationships, namely:
is_a: allows to represent hierarchies (i.e., mRNA is_a Processed transcript)
derived_from: implies a precise relationship between the terms (i.e., polypeptide derive_from mRNA)
part_of: allows to represent part-whole relationships between terms (i.e., exon is a part_of transcript)
Implementation in RDF for SPHN
The SO is made available as-is by the SPHN DCC.
The namespace used is: <http://purl.obolibrary.org/obo/SO_>
A version IRI is provided for each version of SO in RDF which indicates the version (or release) of SO.
For example, http://purl.obolibrary.org/obo/so/2021-11-22/so.owl
indicates that the ontology is from a 2021-11-22
release of SO.
In SO, a concept is defined with the following structure:
SO:0000704 a owl:Class ;
rdfs:label "gene"^^xsd:string ;
IAO:0000115 "A region (or regions) that includes all of the sequence elements necessary to encode a functional transcript. A gene may include regulatory regions, transcribed regions and/or other functional sequence regions."^^xsd:string ;
oboInOwl:hasDbXref "http://en.wikipedia.org/wiki/Gene"^^xsd:string ;
oboInOwl:hasExactSynonym "INSDC_feature:gene"^^xsd:string ;
oboInOwl:hasOBONamespace "sequence"^^xsd:string ;
oboInOwl:id "SO:0000704"^^xsd:string ;
oboInOwl:inSubset so:SOFA ;
rdfs:comment "This term is mapped to MGED. Do not obsolete without consulting MGED ontology. A gene may be considered as a unit of inheritance."^^xsd:string ;
rdfs:subClassOf [ a owl:Restriction ;
owl:onProperty so:member_of ;
owl:someValuesFrom SO:0005855 ],
SO:0001411 .
Availability and usage rights
The SO RDF file is available via the Terminology Service.
SO is maintained by the Eilbeck Lab, Department of Biomedical Informatics, University of Utah, Salt Lake City.
SO data and data products are licensed under the Creative Commons Attribution 4.0 Unported License (http://www.sequenceontology.org/?page_id=345)