sramongo mappings

The database created by sra2mongo consists of a single document that is organized hierarchically:

This can be thought of as a giant JSON or python dict which various levels can be accessed by indexing through (e.g., ncbi.sra.run.run_id). MongoDB has a very nice querying system which allows easy searching through the document.

Note

One downside of storing all of this information as a single document is that mongoDB has a max document size of 16 MB. This is more than enough for storing metadata and text, but if you start adding data tables you may hit this limit.

ncbi

This is the top level document. Information from each database is stored under its name. As I add data normalization steps I intend to aggregate data from the different databases and store them up in this top level document.

sra

This stores all from the Sra. There are also a couple of summary fields that are stored at this level. Each section of the SRA record are represented as subdocuments.

class sramongo.models.SraDocument(*args, **values)[source]

organization

class sramongo.models.Organization(*args, **kwargs)[source]

Organization embedded document.

An organization contains information about the group that submitted to sra. For example, all data submitted to GEO are submitted to SRA using the GEO credentials.

organization_type

Weather this organization is a center or individual or some other kind of group.

Type:str
abbreviation

A short name for the organization.

Type:str
name

Name of the organization.

Type:str
emai

Contact email address.

Type:str
first_name

First name of the person who submitted the data.

Type:str
last_name

First name of the person who submitted the data.

Type:str

study

class sramongo.models.Study(*args, **kwargs)[source]

The contents of a SRA study.

A study consists of a set of experiments designed with an overall goal in mind. For example, this could include a control experiment and a treatment experiment with the goal being to identify expression differences resulting from the treatment. The SRA study is the top level of the submission hierarchy.

accn

The primary identifier for a study. Identifiers begin with SRP/ERP/DRP depending on which database they originate from.

Type:mongoengine.StringField
bioproject

The associated BioProject identifier.

Type:mongoengine.StringField
geo

The associated GEO identifier.

Type:mongoengine.StringField
geo

The associated Pubmed identifiers.

Type:mongoengine.StringField
title

The title of the study.

Type:mongoengine.StringField
abstract

Abstract of the study.

Type:mongoengine.StringField
center_name

Name of the submitting center.

Type:mongoengine.StringField
center_project_name

Center specific identifier for the study.

Type:mongoengine.StringField
description

Additional text describing the study.

Type:mongoengine.StringField

run

class sramongo.models.Run(*args, **kwargs)[source]

Run Document.

A Run describes a dataset generated from an Experiment. For example if a Experiment is sequenced on multiple lanes of a Illumina flowcell then data from each lane are considered a Run.

srr

The primary identifier for a run. Identifiers begin with SRR/ERR/DRR depending on which database they originate from.

Type:mongoengine.StringField
nspots

The total number of spots on a Illumina flowcell.

Type:mongoengine.IntField
nbases

The number of bases.

Type:mongoengine.IntField
nreads

The number of reads.

Type:mongoengine.IntField
read_count_r1

Some Runs have additional information on reads. This is the number of reads from single ended or the first read pair in pair ended data.

Type:mongoengine.FloatField
read_len_r1

This is the average length of reads from single ended or the first read pair in pair ended data.

Type:mongoengine.FloatField
read_count_r2

This is the number of reads from the second read pair in pair ended data.

Type:mongoengine.FloatField
read_len_r2

This is the avearge length of reads from the second read pair in pair ended data.

Type:mongoengine.FloatField
release_date

Release date of the Run. This information is from the runinfo table and not the XML.

Type:mongoengine.DateTimeField
load_date

Date the Run was uploaded. This information is from the runinfo table and not the XML.

Type:mongoengine.DateTimeField
size_MB

Size of the Run file. This information is from the runinfo table and not the XML.

Type:mongoengine.IntField

sample

class sramongo.models.Sample(*args, **kwargs)[source]

The contents of a SRA sample.

A sample is the biological unit. An individual sample or a pool of samples can be use in the SRA Experiment. This document contains information describing the sample ranging from species information to detailed descriptions of what and how material was collected.

accn

The primary identifier for a sample. Identifiers begin with SRS/ERS/DRS depending on which database they originate from.

Type:mongoengine.StringField
biosample

The associated BioSample identifier.

Type:mongoengine.StringField
geo

The associated GEO identifier.

Type:mongoengine.StringField
title

The title of the sample.

Type:mongoengine.StringField
taxon_id

The NCBI taxon id.

Type:mongoengine.IntField
scientific_name

The scientific name.

Type:mongoengine.StringField
common_name

The common name.

Type:mongoengine.StringField
attributes

A set of key:value pairs describing the sample. For example tissue:ovary or sex:female.

Type:mongoengine.DictField

biosample

Information from the BioSample database is stored here.

class sramongo.models.BioSample(*args, **kwargs)[source]

The contents of a BioSample.

BioSample is another database housed at NCBI which records sample metadata. This information should already be present in the Sra.sample information, but to be safe we can pull into the BioSample for additional metadata.

accn

The primary identifier for a BioSample. These are the accession number which begin with SAM.

Type:mongoengine.StringField
id

The primary identifier for a BioSample. These are the id number.

Type:mongoengine.IntField
title

A free text description of the sample.

Type:mongoengine.StringField
description

A free text description of the sample.

Type:mongoengine.StringField
publication_date

Date the sample was published.

Type:mongoengine.StringField
last_update

Last time BioSample updated sample information.

Type:mongoengine.StringField
submission_date

Date the sample was submitted

Type:mongoengine.StringField
attributes

A list of dictionaries containing key:value pairs describing the experiment. The stored dictionaries are of the form {‘name’: value, ‘value’: value}. This was done to make querying easier.

Type:mongoengine.ListField of mongoengine.DictField

bioproject

Information from the BioProject database is stored here.

class sramongo.models.BioProject(*args, **kwargs)[source]

The contents of a BioProject.

BioProject is another database housed at NCBI which records project metadata. This information should already be present in the SRA information, but to be safe we can pull into the BioProject for additional metadata.

accn

The primary identifier for a BioProject. These are the accession number which begin with PRJ.

Type:mongoengine.StringField
id

The primary identifier for a BioProject. These are the id numbers.

Type:mongoengine.IntField
name

A brief name of the project.

Type:mongoengine.StringField
title

The title of the project.

Type:mongoengine.StringField
description

A short description of the project.

Type:mongoengine.StringField
last_date

Last date the BioProject was updated.

Type:mongoengine.DateTimeField
submission_date

Date the BioProject was submitted.

Type:mongoengine.DateTimeField

pubmed

Information from the Pubmed is stored here.

class sramongo.models.Pubmed(*args, **kwargs)[source]

The contents of a Pubmed document.

This document contains specific information about publications.

accn

The primary identifier for Pubmed. These are the accession number which begin with PMID.

Type:mongoengine.StringField
title

Title of the paper.

Type:mongoengine.StringField
abstract

Paper abstract.

Type:mongoengine.StringField
authors

List of authors.

Type:mongoengine.ListField
citation

Citation information for the paper.

Type:mongoengine.StringField
date_created

Date the pubmed entry was created.

Type:mongoengine.DateTimeField
date_completed

Date the pubmed entry was completed.

Type:mongoengine.DateTimeField
date_revised

Date the pubmed entry was last updated.

Type:mongoengine.DateTimeField