A user guide for SCPortalen
This user guide is intended for use by researchers with interest in single-cell omics data. The user guide will explain step-by-step three exemplary use-cases of the SCPortalen. SCPortalen is a single-cell centric, i.e. the database integrate the dataset based on a single-cell resolution. Each single-cell assigned a unique cell identification number (cell_id). Through this cell_is user will be able to search and retrieve detailed information about each single-cell. A group of single-cell generated from single study / research has a study_id.
From the landing page of SCPortalen (www.single-cell.clst.riken.jp). At first, the user selects the desired type of dataset to work with.
Currently the SCPortalen distinguishes two type of dataset based on the type of the dataset:
1. Single-cell images dataset.
2. Single-cell Transcriptomics dataset.
This dataset features single-cell microscopic images and movies, but it stores experimental metadata, images analysis and gene expression data on the single-cell level. All these data can be used for different purposes. Here we introduce three potential use-cases:
1- Cell images enable quality checks on individual cells: users start from querying the metadata (e.g. cell fluorescence, cDNA concentration, read counts, etc.) and then select the cell/ or subset of cells of interest
2- Explore cell images and associated expression profiles
3- Search for genes of interest and find its associated expression values [TPM] and track a single cell based on a gene expression value.
The system is not limited to the above examples, it provides wide range of features that could be explored by the user to work with the available datasets.
The cell with the cell_id 1772-062-248_A01 is displayed in the 3rd row of the first table of the Fucci metadata see Figure (1). The associated cell images (bright field, green fluorescence and red fluorescence mode) and sequencing information are linked to the cell 1772-062-248_A01 click on the link cell images linked to cell images, clicking on Sequencing will redirect the user to FASTQ files (two files per cell) and links for BAM files , expression table and genome coordinate visualizations.
Figure 1: List of 12 cells (one batch) in the Fucci metadata view
Here we explain how cell images can be easily explored using the query interface. The query enables the filtering of the cells according to specific data values (quantitate and binary attributes). For example, one can filter a cell according to the distribution of fluorescence values in the cell. In the Fucci metadata the cells could be filtered according to a cut-off value in a fluorescence mode (e.g. Ch2_corrected column in the metadata table) in which the green fluorescence value is recorded (Figure 2).
Figure 2: Search for a cell with green fluorescence concentration of more than 120 TPM / Explore cell image and expression profile.
With the Fucci metadata the cells could be filtered according to the type of error that was assigned based on manually checking all cell image files.
This use-case demonstrates how to search for gene(s) of interest and retrieve the expression [TPM] and get all cells expressing a particular gene. The search interface accepts three parameters for search including the gene symbol, Encode gene id, and gene description Figure (3).
Figure 3: Searching for gene of interest in all samples. The output is a list of single cells and the associated expression value [TPM] per cell.
All database data is provided for bulk download in two formats:
1- MySQL format: SQL files with the data definition and manipulation statement embedded all the tables are easy to import in any version of the MySQL database.
2- CSV format: We provide the data in CSV format. A common type widely used in Biology and Bioinformatics and easily accessible via text editor programs.
Selected data can be exported in different formats (e.g. CSV, XML). This flexibility in data format will allow users to obtain data in a way that is most suitable for their individual platform requirements.
Under the Single-cell transcriptomics dataset section of the database, the user will find collected datasets that have been published elsewhere. For each dataset we performed manual curation of the meta information, ontology and functional annotation, search for expressed gene in each single-cell sample. Provide an overview of the dataset, options for download, and link to the original journal publication Figure (4). Each dataset is searchable by accession number or any phrase / key word. Through the interface in figure 5, user can download the curated metadata 1 or metadata provided by the authors 2 in compressed format.
The dataset view in figure 5 provide principle component analysis (PCA) and the PCA matrix 3 and t-Distributed Stochastic Neighbor Embedding (t-SNE) 4 analysis of the dataset. From the dataset overview the user can select to the detailed information about each single-cell by clicking on the single-cell sample list 5.
Figure 4: Single-cell Transcriptomics dataset / study layout.
A detailed Meta information about each single-cell sample is provided based on a manual curation. For each single-cell a minimum set of attributes are provided as illustrated in Figure (5). We used the run and sample accession number as provided by the public data repository (DDBJ, GEO and ENA) as a cell identification number. Another attributes assigned to each single-cell are the study accession number, organism from which the sample were dissected, cell type, sequencing platform, assay type, library protocol and layout, and the ontology term assigned to the single-cell. Using any of these attributes the user is able to retrieve a single-cell or a group of cell that satisfied the search criteria.
A detailed meta information about each single-cell sample is provided based on a manual curation. For each single-cell a minimum set of attributes are provided as illustrated in Figure (5). We used the run and sample accession number as provided by the public data repository (DDBJ, GEO and ENA) as a cell identification number. Another attributes assigned to each single-cell are the study accession number, organism from which the sample were dissected, cell type, sequencing platform, assay type, library protocol and layout, and the ontology term assigned to the single-cell. Using any of these attributes the user is able to retrieve a single-cell or a group of cell that satisfied the search criteria.
For each single-cell we provided detailed information and files for display and download: http://single-cell.clst.riken.jp/non_riken_data/Single_cell_samples1_list.php
1- Sequence data: for each single cell we provide link to raw sequence data in sra or FASTQ format for download
2- Mapping QC, we developed a model for calculating genomic contamination in each single-cell samples. The mapping QC information includes; number of input reads,% of uniquely mapped reads,% of reads mapped to multiple loci,% of reads mapped to too many loci, and assigned read rate.
3- FASTQC report, we utilized the FASTQC tool  to provide some quality control checks on raw sequence data generated from single-cell. For each single-cell full FASTQC report is provided for preview and download.
4- Library preparation information: we did a manual curation and fetched the detailed information about the library preparation protocol, kits and the method used for extracting single-cell.
5- BAM files: used STAR genome aligner and mapped sequence data of each single-cell to the reference genome (hg38 / mm10). The resulting BAM files are provided for download for each single-cell or for each study.
6- Ontology annotation: Assign ontology term for each single-cell by selecting one term from [Cell ontology, Cell line Ontology] and we utilized EBI Ontology Lookup web service for visualization of the ontology tree
Based on the generated expression table user can explore the gene expression correlation matrix for pair of single-cells see
Figure 5: Gene expression correlation matrix.
The dots in the right panel are shown only when the user click on cell-to-cell gene expression (left panel). Since this is graphic matrix it take few seconds to display the log2 gene expression for each specific genes. Each dot in the right panel of the figure represent the log2 gene expression. When the use hover on each of the dots the Ensemble Gene id will display
DAVID functional annotation
Functional annotation using DAVID; we utilized the API provided by DAVID tool. The tool give two types of document. The annotation report and annotation chart see http://single-cell.clst.riken.jp/non_riken_data/DAVID_Functional_Annotations_list.php
Search for the expression of gene / set of genes
Search for expressed gene in each dataset: Same as illustrated before, user can search for the expression of the gene / set of genes in each dataset and be able to see the detailed of the sample(s) that expressed the set of genes.
We consider interface for bulk download of BAM files, expression tables and FASTQC report.E.g. for the BAM files we provided a list of urls to be used in Linux wget command e.g.wget -r -nH --cut-dirs=2 --no-parent --reject="index.html*"http://single-cell.clst.riken.jp/hg38_bam/hg38_DRA001287/DRR015133.Aligned.sortedByCoord.out.bam
Division of Genomic Technologies
RIKEN Center for Life Science Technologies
1-7-22 Suehiro-chō , Tsurumi-ku, Yokohama, Kanagawa, 230-0045 JAPAN
Tel: +81-45-503-9245,Fax: +81-45-503-9216