sramongo mappings¶

The database created by sra2mongo consists of a single document that is organized hierarchically:

ncbi
- sra
  - organization
  - submission
  - study
  - run
  - sample
- biosample
- bioproject
- pubmed

This can be thought of as a giant JSON or python dict which various levels can be accessed by indexing through (e.g., ncbi.sra.run.run_id). MongoDB has a very nice querying system which allows easy searching through the document.

Note

One downside of storing all of this information as a single document is that mongoDB has a max document size of 16 MB. This is more than enough for storing metadata and text, but if you start adding data tables you may hit this limit.

ncbi ¶

This is the top level document. Information from each database is stored under its name. As I add data normalization steps I intend to aggregate data from the different databases and store them up in this top level document.

sra ¶

This stores all from the Sra. There are also a couple of summary fields that are stored at this level. Each section of the SRA record are represented as subdocuments.

class sramongo.models.SraDocument(*args, **values)[source]¶

organization ¶

class sramongo.models.Organization(*args, **kwargs)[source]¶

Organization embedded document.

An organization contains information about the group that submitted to sra. For example, all data submitted to GEO are submitted to SRA using the GEO credentials.

organization_type¶

Weather this organization is a center or individual or some other kind of group.

Type:	str

abbreviation¶

A short name for the organization.

Type:	str

name¶

Name of the organization.

Type:	str

emai¶

Contact email address.

Type:	str

first_name¶

First name of the person who submitted the data.

Type:	str

last_name¶

First name of the person who submitted the data.

Type:	str

submission ¶

study ¶

class sramongo.models.Study(*args, **kwargs)[source]¶

The contents of a SRA study.

A study consists of a set of experiments designed with an overall goal in mind. For example, this could include a control experiment and a treatment experiment with the goal being to identify expression differences resulting from the treatment. The SRA study is the top level of the submission hierarchy.

accn¶

The primary identifier for a study. Identifiers begin with SRP/ERP/DRP depending on which database they originate from.

Type:	mongoengine.StringField

bioproject¶

The associated BioProject identifier.

Type:	mongoengine.StringField

geo¶

The associated GEO identifier.

Type:	mongoengine.StringField

geo

The associated Pubmed identifiers.

Type:	mongoengine.StringField

title¶

The title of the study.

Type:	mongoengine.StringField

abstract¶

Abstract of the study.

Type:	mongoengine.StringField

center_name¶

Name of the submitting center.

Type:	mongoengine.StringField

center_project_name¶

Center specific identifier for the study.

Type:	mongoengine.StringField

description¶

Additional text describing the study.

Type:	mongoengine.StringField

run ¶

class sramongo.models.Run(*args, **kwargs)[source]¶

Run Document.

A Run describes a dataset generated from an Experiment. For example if a Experiment is sequenced on multiple lanes of a Illumina flowcell then data from each lane are considered a Run.

srr¶

The primary identifier for a run. Identifiers begin with SRR/ERR/DRR depending on which database they originate from.

Type:	mongoengine.StringField

nspots¶

The total number of spots on a Illumina flowcell.

Type:	mongoengine.IntField

nbases¶

The number of bases.

Type:	mongoengine.IntField

nreads¶

The number of reads.

Type:	mongoengine.IntField

read_count_r1¶

Some Runs have additional information on reads. This is the number of reads from single ended or the first read pair in pair ended data.

Type:	mongoengine.FloatField

read_len_r1¶

This is the average length of reads from single ended or the first read pair in pair ended data.

Type:	mongoengine.FloatField

read_count_r2¶

This is the number of reads from the second read pair in pair ended data.

Type:	mongoengine.FloatField

read_len_r2¶

This is the avearge length of reads from the second read pair in pair ended data.

Type:	mongoengine.FloatField

release_date¶

Release date of the Run. This information is from the runinfo table and not the XML.

Type:	mongoengine.DateTimeField

load_date¶

Date the Run was uploaded. This information is from the runinfo table and not the XML.

Type:	mongoengine.DateTimeField

size_MB¶

Size of the Run file. This information is from the runinfo table and not the XML.

Type:	mongoengine.IntField

sample ¶

class sramongo.models.Sample(*args, **kwargs)[source]¶

The contents of a SRA sample.

A sample is the biological unit. An individual sample or a pool of samples can be use in the SRA Experiment. This document contains information describing the sample ranging from species information to detailed descriptions of what and how material was collected.

accn¶

The primary identifier for a sample. Identifiers begin with SRS/ERS/DRS depending on which database they originate from.

Type:	mongoengine.StringField

biosample¶

The associated BioSample identifier.

Type:	mongoengine.StringField

geo¶

The associated GEO identifier.

Type:	mongoengine.StringField

title¶

The title of the sample.

Type:	mongoengine.StringField

taxon_id¶

The NCBI taxon id.

Type:	mongoengine.IntField

scientific_name¶

The scientific name.

Type:	mongoengine.StringField

common_name¶

The common name.

Type:	mongoengine.StringField

attributes¶

A set of key:value pairs describing the sample. For example tissue:ovary or sex:female.

Type:	mongoengine.DictField

biosample ¶

Information from the BioSample database is stored here.

class sramongo.models.BioSample(*args, **kwargs)[source]¶

The contents of a BioSample.

BioSample is another database housed at NCBI which records sample metadata. This information should already be present in the Sra.sample information, but to be safe we can pull into the BioSample for additional metadata.

accn¶

The primary identifier for a BioSample. These are the accession number which begin with SAM.

Type:	mongoengine.StringField

id¶

The primary identifier for a BioSample. These are the id number.

Type:	mongoengine.IntField

title¶

A free text description of the sample.

Type:	mongoengine.StringField

description¶

A free text description of the sample.

Type:	mongoengine.StringField

publication_date¶

Date the sample was published.

Type:	mongoengine.StringField

last_update¶

Last time BioSample updated sample information.

Type:	mongoengine.StringField

submission_date¶

Date the sample was submitted

Type:	mongoengine.StringField

attributes¶

A list of dictionaries containing key:value pairs describing the experiment. The stored dictionaries are of the form {‘name’: value, ‘value’: value}. This was done to make querying easier.

Type:	mongoengine.ListField of mongoengine.DictField

bioproject ¶

Information from the BioProject database is stored here.

class sramongo.models.BioProject(*args, **kwargs)[source]¶

The contents of a BioProject.

BioProject is another database housed at NCBI which records project metadata. This information should already be present in the SRA information, but to be safe we can pull into the BioProject for additional metadata.

accn¶

The primary identifier for a BioProject. These are the accession number which begin with PRJ.

Type:	mongoengine.StringField

id¶

The primary identifier for a BioProject. These are the id numbers.

Type:	mongoengine.IntField

name¶

A brief name of the project.

Type:	mongoengine.StringField

title¶

The title of the project.

Type:	mongoengine.StringField

description¶

A short description of the project.

Type:	mongoengine.StringField

last_date¶

Last date the BioProject was updated.

Type:	mongoengine.DateTimeField

submission_date¶

Date the BioProject was submitted.

Type:	mongoengine.DateTimeField

pubmed ¶

Information from the Pubmed is stored here.

class sramongo.models.Pubmed(*args, **kwargs)[source]¶

The contents of a Pubmed document.

This document contains specific information about publications.

accn¶

The primary identifier for Pubmed. These are the accession number which begin with PMID.

Type:	mongoengine.StringField

title¶

Title of the paper.

Type:	mongoengine.StringField

abstract¶

Paper abstract.

Type:	mongoengine.StringField

authors¶

List of authors.

Type:	mongoengine.ListField

citation¶

Citation information for the paper.

Type:	mongoengine.StringField

date_created¶

Date the pubmed entry was created.

Type:	mongoengine.DateTimeField

date_completed¶

Date the pubmed entry was completed.

Type:	mongoengine.DateTimeField

date_revised¶

Date the pubmed entry was last updated.

Type:	mongoengine.DateTimeField