BayesPrism Gateway

BayesPrism Documentation

1    Login
The user needs to log in by clicking ' Log in' link at the top-right corner of the page. Having an account provides a number of benefits, and is free and easy.

BayesPrism login
Figure 1: Login page

2    Create a new experiment
Select the BayesPrism application on the dashboard panel to create a data analysis for your data, as the following screenshot (Figure 2).

BayesPrism panel
Figure 2: BayesPrism dashboard

3    Set experiment name
Rename Experiment Name, and click Add a descriptionto comment on the experimental setup (optional). Choose the project that the experiment belongs to. By default, the "Default Project" is created and used.

BayesPrism experiment name
Figure 3: Start new BayesPrism experiment

4    Upload count matrix files
The bayesPrism need two types of count matrix file: the bulk RNA-seq count matrix and the reference count matrix. Currently we implement multiple data import from the dfferent data source, such as tsv, xls, rds/dataframe, rds/suerat, h5ad. For details please check the input tab of this page.
The gateway provides two ways to upload count matrix files for users. (1) Click "Select files from storage" to choose existing files submitted for previous tasks, or (2) click "Drop files here or browse" to upload new files from user's storage.
Note:
(1) Each row of count matrix indicates one unique gene id, so the count matrices should have same gene set in the bulk and in the reference.
(2) Count matrices can not be normalized
(3) At least 50 reads for each cell type are suggested.

Upload count matrix files
Figure 4: Upload count matrix files

5    Set computing parameters
(1) Specify species for the gene removal in ribosomal, mitochondria, chrX, and chrY. For other species, the users need to remove these genes manually.
(2) Specify the cell type and tumor state for each cell sample in the reference count matrix using CSV format. 4 columns are defined: cell_id, cell_type, cell_subtype and tumor_state. The tumor state should be 0 (non-tumor) or 1(Tumor).
(3) Specify the prefix of the output files. This can help distinguish results from multiple experiments.

BayesPrism parameters
Figure 5: Set computing parameters

6    Submit the job
Once steps 1-5 are finished, proceed to "save and launch". Input data and parameters will be submitted to the computing node of the ACCESS cluster via the BayesPrism gateway server. Click the checkbox next to "Receive email notification of experiment status" if needed. Upon launching, users will be directed to the "Experiments" page, shown in Fig. 4. A typical experiment usually finishes within 4 hrs. Users may view the progress by logging in and clicking the "Experiment button on the left control panel at the dashboard.

7    Check the status
Users may view the progress by logging in and clicking the "Experiments" button on the left control panel at the dashboard. All experiments submitted are listed on this page.

BayesPrism experiment browse
Figure 6: Check the experiment status

8    Check the results
Once a job is completed, the user can click selected BayesPrism experiment and the website will jump to Experiment Summary page. All parameters used to set up the experiment are listed on this page. The user can also access output files of BayesPrism stored in the ARCHIVE. Just click the ARCHIVEto check any single result file. A compressed file, including input count matrix file set, two task log files and all result files, is also provided for users. Click Download Zipbutton to download a compressed file. The downloaded file with the 'tar.gz' extension can be decompressed by the 'tar' command, the file with the 'gz' extension can be decompressed by the 'gunzip' command in Linux.
In Safari, it could be problematic because Safari tries to unzip the compressed results automatically using a non-compatible compress method. Please disable this feature.

bayesPrism experiment archive
Figure 7: BayesPrism Archive

9    Tutorial: Bulk RNA-seq deconvolution using BayesPrism
The tutorial is about loading the BayesPrism package, loading the dataset, QC of cell type and state labels, filtering outlier genes, constructing a prism object, running BayesPrism, extracting results, and downstream analysis.

The input to BayesPrism consists of two input matrices which represent the raw read count in bulk samples and the single-cell RNA-seq reference which can be supplied as either cell-by-gene raw count matrix (Reference Data Type=count matrix) or user-generated cell state-by-gene expression profile (Reference Data Type=GEP). Our gateway allows the count matrix or GEP describing the scRNA-seq reference to be exported from other single cell packages, such as Seurat and CellRanger. The details of the data format for the input matrices are described below.

1   Input Matrices

  • The bulk expression matrix is in the format of genes (rows) by sample IDs (columns) (see the matBulk in Figure 1).
  • If the input of the single cell reference is provided as a raw count matrix, it should be in the format of genes (rows) by cell IDs (columns) (see the matRef in Figure 1). Alternatively, if the input of the single cell reference is provided as a GEP, it should be in the format of genes (rows) by cell states (columns).
  • The gateway accepts the following formats for these matrices: RDS, TSV, XLS, and h5ad.
  • Data format of input variables
    Figure 1: data format in R

    In addition, the users should note that:

     (1) The bulk matrix and the reference matrix should use the same gene annotation. BayesPrism will perform deconvolution over the genes shared between these two matrices.

     (2) Raw read count is always preferred, as BayesPrism models the count directly. If raw count is not obtainable, BayesPrism is also robust to linearly transformed data, such as CPM, RPKM, TPM. Log-transformed data should be avoided.

     (3) We recommend representing each cell state using at least 20 to 50 cells (depending on the library size of the data).

     (4) GEP can be generated by summing raw counts for each cell state.


    Data Format Used For Description
    TSV Bulk,Reference A tab-separated values file containing read counts of each genes (rows) in each bulk sample / single cell (column). BayesPrism takes the TSV header as column names and takes the first column as row names.
    XLS Bulk,Reference An Excel file contains read counts for each genes (rows) in each bulk sample / single cell (column). BayesPrism takes the first row as column names and takes the first column as row names.
    RDS/dataframe Bulk,Reference An RDS file of an R dataframe containing read counts for each genes (rows) in each bulk sample / single cell (column). BayesPrism requires the data frame to have rownames and colnames.
    RDS/sce Reference An RDS file of a SingleCellExperiment object representing read counts for each genes (rows) in each single cell (column).
    RDS/seurat Reference An RDS file of a Seurat object representing single-cell expression data. Each Seurat object revolves around a set of cells and consists of one or more Assay objects.
    h5ad Reference Hierarchical Data Format version 5 (HDF5) is used to store both the expression values and associated annotations on the genes and cells in Python. H5AD format can be read into R as a SingleCellExperiment.

    * GEP only supports TSV, XLS, and RDS/dataframe.


    2   Cell Metadata

    If the reference matrix does not contain the cell type and the tumor state for each cell, the users must provide a CSV file to denote the cell type and tumor state for each cell. The CSV illustrated in Figure 1 (cellprofile) have 4 columns: cell id, cell type, cell subtype, and tumor state ( 0 for normal or 1 for tumor ).

    3   Species

    BayesPrism removes genes in ribosomal, mitochondria, chrX, and chrY before deconvolution. For deconvolution using bulk and reference from unmatched sex, we recommend users to exclude genes from chrX and chrY.

    If the gene annotation is not human or mouse, users need to remove these genes manually.

    1   BayesPrism output files

    BayesPrism generates a RDATA file ($PREFIX.rdata) for R users and a compressed file ($PREFIX.tar.gz) for Python users.

    R users can open RDATA file using "load" commmand easily. Python users need to extract multiple RDS files (see the following table) using the decommpresion command "tar -xvzf" on Linux

    Note: All files below are stored in the "ARCHIVE" directory.

    File name Description
    $PREFIX.rdata This Rdata file contains the 'bp.res' object which can be explored by the 'str' command. The following table shows all contents.
    $PREFIX.tar.gz The compressed file contains multiple RDS data which represent the items of the 'bp.res' object.
    $plot.tar.gz The compressed file contains all the plots that you may need. All plots are showed in the third part.
    $out.bk.vs.sc.pdf The plot indicates the concordance of gene expression for different types of genes. Note this only works for human data. For other species, you are advised to make plots by yourself.

    2   Contents in $PREFIX.rdata

    Name Description
    bp@prism The input prism.
    bp@prism@phi_cellState@phi The expression matrix in the format of cell states(rows) by genes(columns).
    bp@prism@phi_cellType@phi The expression matrix in the format of cell types(rows) by genes(columns).
    bp@prism@map The information of all the cell types and cell states.
    bp@prism@mixture The mean count of gene expression in each bulk sample.
    bp@posterior.initial.cellState The results of step2.
    bp@posterior.initial.cellState@Z The estimation of the mean of posterior read count for each cell state in each bulk sample.
    bp@posterior.initial.cellState@theta The initial estimation of fraction for all cell state in each bulk sample.
    bp@posterior.initial.cellState@theta.cv The coefficient of variation (CV) of cell state fraction.
    bp@posterior.initial.cellType The results of step3.
    bp@posterior.initial.cellType@Z The estimation of the mean of posterior read count for each cell type in each bulk sample.
    bp@posterior.initial.cellType@theta The initial estimation of fraction for all cell type in each bulk sample.
    bp@posterior.initial.cellType@theta.cv The coefficient of variation (CV) of cell type fraction
    bp@reference.update The updated reference ψ.
    bp@reference.update@psi_mal The gene expression profile of each tumor sample.
    bp@reference.update@psi_env The gene expression profile of each non-tumor sample.
    bp@posterior.theta_f The results of step4.
    bp@posterior.theta_f@theta The final estimation of fraction for all cell type.
    bp@posterior.theta_f@theta.cv The coefficient of variation (CV) of cell type fraction.
    bp@control_param The parameters to run BayesPrism.

    3   Read RDS results in Python.

    Python users can use 'pyreadr' to read RDS file (https://stackoverflow.com/questions/40996175/loading-a-rds-file-in-pandas).

    Here we briefly show how to read it in Python.

    import pyreadr

    result = pyreadr.read_r('bp.posterior.initial.cellType.theta.rds')

    # Extract the pandas data frame. In the case of Rds there is only one object with None as key
    df = result[None]

    4   Tutorial and downstream analysis, please click this link.