sramongo mappings¶
The database created by sra2mongo
consists of a single document that is
organized hierarchically:
This can be thought of as a giant JSON or python dict which various levels can
be accessed by indexing through (e.g., ncbi.sra.run.run_id
). MongoDB has a
very nice querying system which allows easy searching through the document.
Note
One downside of storing all of this information as a single document is that mongoDB has a max document size of 16 MB. This is more than enough for storing metadata and text, but if you start adding data tables you may hit this limit.
ncbi¶
This is the top level document. Information from each database is stored under its name. As I add data normalization steps I intend to aggregate data from the different databases and store them up in this top level document.
sra¶
This stores all from the Sra. There are also a couple of summary fields that are stored at this level. Each section of the SRA record are represented as subdocuments.
organization¶
-
class
sramongo.models.
Organization
(*args, **kwargs)[source]¶ Organization embedded document.
An organization contains information about the group that submitted to sra. For example, all data submitted to GEO are submitted to SRA using the GEO credentials.
study¶
-
class
sramongo.models.
Study
(*args, **kwargs)[source]¶ The contents of a SRA study.
A study consists of a set of experiments designed with an overall goal in mind. For example, this could include a control experiment and a treatment experiment with the goal being to identify expression differences resulting from the treatment. The SRA study is the top level of the submission hierarchy.
-
accn
¶ The primary identifier for a study. Identifiers begin with SRP/ERP/DRP depending on which database they originate from.
Type: mongoengine.StringField
-
bioproject
¶ The associated BioProject identifier.
Type: mongoengine.StringField
-
geo
¶ The associated GEO identifier.
Type: mongoengine.StringField
-
geo
The associated Pubmed identifiers.
Type: mongoengine.StringField
-
title
¶ The title of the study.
Type: mongoengine.StringField
-
abstract
¶ Abstract of the study.
Type: mongoengine.StringField
-
center_name
¶ Name of the submitting center.
Type: mongoengine.StringField
-
center_project_name
¶ Center specific identifier for the study.
Type: mongoengine.StringField
-
description
¶ Additional text describing the study.
Type: mongoengine.StringField
-
run¶
-
class
sramongo.models.
Run
(*args, **kwargs)[source]¶ Run Document.
A Run describes a dataset generated from an Experiment. For example if a Experiment is sequenced on multiple lanes of a Illumina flowcell then data from each lane are considered a Run.
-
srr
¶ The primary identifier for a run. Identifiers begin with SRR/ERR/DRR depending on which database they originate from.
Type: mongoengine.StringField
-
nspots
¶ The total number of spots on a Illumina flowcell.
Type: mongoengine.IntField
-
nbases
¶ The number of bases.
Type: mongoengine.IntField
-
nreads
¶ The number of reads.
Type: mongoengine.IntField
-
read_count_r1
¶ Some Runs have additional information on reads. This is the number of reads from single ended or the first read pair in pair ended data.
Type: mongoengine.FloatField
-
read_len_r1
¶ This is the average length of reads from single ended or the first read pair in pair ended data.
Type: mongoengine.FloatField
-
read_count_r2
¶ This is the number of reads from the second read pair in pair ended data.
Type: mongoengine.FloatField
-
read_len_r2
¶ This is the avearge length of reads from the second read pair in pair ended data.
Type: mongoengine.FloatField
-
release_date
¶ Release date of the Run. This information is from the runinfo table and not the XML.
Type: mongoengine.DateTimeField
-
load_date
¶ Date the Run was uploaded. This information is from the runinfo table and not the XML.
Type: mongoengine.DateTimeField
-
size_MB
¶ Size of the Run file. This information is from the runinfo table and not the XML.
Type: mongoengine.IntField
-
sample¶
-
class
sramongo.models.
Sample
(*args, **kwargs)[source]¶ The contents of a SRA sample.
A sample is the biological unit. An individual sample or a pool of samples can be use in the SRA Experiment. This document contains information describing the sample ranging from species information to detailed descriptions of what and how material was collected.
-
accn
¶ The primary identifier for a sample. Identifiers begin with SRS/ERS/DRS depending on which database they originate from.
Type: mongoengine.StringField
-
biosample
¶ The associated BioSample identifier.
Type: mongoengine.StringField
-
geo
¶ The associated GEO identifier.
Type: mongoengine.StringField
-
title
¶ The title of the sample.
Type: mongoengine.StringField
-
taxon_id
¶ The NCBI taxon id.
Type: mongoengine.IntField
-
scientific_name
¶ The scientific name.
Type: mongoengine.StringField
-
common_name
¶ The common name.
Type: mongoengine.StringField
-
attributes
¶ A set of key:value pairs describing the sample. For example tissue:ovary or sex:female.
Type: mongoengine.DictField
-
biosample¶
Information from the BioSample database is stored here.
-
class
sramongo.models.
BioSample
(*args, **kwargs)[source]¶ The contents of a BioSample.
BioSample is another database housed at NCBI which records sample metadata. This information should already be present in the Sra.sample information, but to be safe we can pull into the BioSample for additional metadata.
-
accn
¶ The primary identifier for a BioSample. These are the accession number which begin with SAM.
Type: mongoengine.StringField
-
id
¶ The primary identifier for a BioSample. These are the id number.
Type: mongoengine.IntField
-
title
¶ A free text description of the sample.
Type: mongoengine.StringField
-
description
¶ A free text description of the sample.
Type: mongoengine.StringField
-
publication_date
¶ Date the sample was published.
Type: mongoengine.StringField
-
last_update
¶ Last time BioSample updated sample information.
Type: mongoengine.StringField
-
submission_date
¶ Date the sample was submitted
Type: mongoengine.StringField
-
attributes
¶ A list of dictionaries containing key:value pairs describing the experiment. The stored dictionaries are of the form {‘name’: value, ‘value’: value}. This was done to make querying easier.
Type: mongoengine.ListField of mongoengine.DictField
-
bioproject¶
Information from the BioProject database is stored here.
-
class
sramongo.models.
BioProject
(*args, **kwargs)[source]¶ The contents of a BioProject.
BioProject is another database housed at NCBI which records project metadata. This information should already be present in the SRA information, but to be safe we can pull into the BioProject for additional metadata.
-
accn
¶ The primary identifier for a BioProject. These are the accession number which begin with PRJ.
Type: mongoengine.StringField
-
id
¶ The primary identifier for a BioProject. These are the id numbers.
Type: mongoengine.IntField
-
name
¶ A brief name of the project.
Type: mongoengine.StringField
-
title
¶ The title of the project.
Type: mongoengine.StringField
-
description
¶ A short description of the project.
Type: mongoengine.StringField
-
last_date
¶ Last date the BioProject was updated.
Type: mongoengine.DateTimeField
-
submission_date
¶ Date the BioProject was submitted.
Type: mongoengine.DateTimeField
-
pubmed¶
Information from the Pubmed is stored here.
-
class
sramongo.models.
Pubmed
(*args, **kwargs)[source]¶ The contents of a Pubmed document.
This document contains specific information about publications.
-
accn
¶ The primary identifier for Pubmed. These are the accession number which begin with PMID.
Type: mongoengine.StringField
-
title
¶ Title of the paper.
Type: mongoengine.StringField
-
abstract
¶ Paper abstract.
Type: mongoengine.StringField
List of authors.
Type: mongoengine.ListField
-
citation
¶ Citation information for the paper.
Type: mongoengine.StringField
-
date_created
¶ Date the pubmed entry was created.
Type: mongoengine.DateTimeField
-
date_completed
¶ Date the pubmed entry was completed.
Type: mongoengine.DateTimeField
-
date_revised
¶ Date the pubmed entry was last updated.
Type: mongoengine.DateTimeField
-