Simulate new single-cell RNA-Seq expression profiles using the ZINB-WaVE model parameters

Simulate single-cell expression profiles by randomly sampling from a negative binomial distribution and inserting dropouts by sampling from a binomial distribution using the ZINB-WaVE parameters estimated by the estimateZinbwaveParams function.

Usage

simSCProfiles(
  object,
  cell.ID.column,
  cell.type.column,
  n.cells,
  suffix.names = "_Simul",
  cell.types = NULL,
  file.backend = NULL,
  name.dataset.backend = NULL,
  compression.level = NULL,
  block.processing = FALSE,
  block.size = 1000,
  chunk.dims = NULL,
  verbose = TRUE
)

Arguments

object: SpatialDDLS object with single.cell.real and zinb.params slots.
cell.ID.column: Name or column number corresponding to the cell names of expression matrix in cells metadata.
cell.type.column: Name or column number corresponding to the cell type of each cell in cells metadata.
n.cells: Number of simulated cells generated per cell type (i.e. if you have 10 different cell types in your dataset, if n.cells = 100, then 1000 cell profiles will be simulated).
suffix.names: Suffix used on simulated cells. This suffix must be unique in the simulated cells, so make sure that this suffix does not appear in the real cell names.
cell.types: Vector indicating the cell types to simulate. If NULL (by default), n.cells single-cell profiles for all cell types will be simulated.
file.backend: Valid file path to store the simulated single-cell expression profiles as an HDF5 file (NULL by default). If provided, the data are stored in HDF5 files used as back-end by using the DelayedArray, HDF5Array and rhdf5 packages instead of loading all data into RAM memory. This is suitable for situations where you have large amounts of data that cannot be loaded into memory. Note that operations on this data will be performed in blocks (i.e subsets of determined size) which may result in longer execution times.
name.dataset.backend: Name of the dataset in HDF5 file to be used. Note that it cannot exist. If NULL (by default), a random dataset name will be used.
compression.level: The compression level used if file.backend is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See ?getHDF5DumpCompressionLevel from the HDF5Array package for more information.
block.processing: Boolean indicating whether the data should be simulated in blocks (only if file.backend is used, FALSE by default). This functionality is suitable for cases where is not possible to load all data into memory and it leads to larger execution times.
block.size: Only if block.processing = TRUE. Number of single-cell expression profiles that will be simulated in each iteration during the process. Larger numbers result in higher memory usage but shorter execution times. Set according to available computational resources (1000 by default). Note that it cannot be greater than the total number of simulated cells.
chunk.dims: Specifies the dimensions that HDF5 chunk will have. If NULL, the default value is a vector of two items: the number of genes considered by the ZINB-WaVE model during the simulation and a single sample in order to reduce read times in the following steps. A larger number of columns written in each chunk can lead to longer read times in subsequent steps. Note that it cannot be greater than the dimensions of the simulated matrix.
verbose: Show informative messages during the execution (TRUE by default).

Value

A SpatialDDLS object with single.cell.simul slot containing a SingleCellExperiment object with the simulated single-cell expression profiles.

Details

Before this step, see ?estimateZinbwaveParams. As described in Torroja and Sanchez-Cabo, 2019, this function simulates a given number of transcriptional profiles for each cell type provided by randomly sampling from a negative binomial distribution with \(\mu\) and \(\theta\) estimated parameters and inserting dropouts by sampling from a binomial distribution with probability pi. All parameters are estimated from single-cell real data using the estimateZinbwaveParams function. It uses the ZINB-WaVE model (Risso et al., 2018). For more details about the model, see ?estimateZinbwaveParams and Risso et al., 2018.

The file.backend argument allows to create a HDF5 file with simulated single-cell profiles to be used as back-end to work with data stored on disk instead of loaded into RAM. If the file.backend argument is used with block.processing = FALSE, all the single-cell profiles will be simulated in one step and, therefore, loaded into in RAM memory. Then, data will be written in HDF5 file. To avoid to collapse RAM memory if too many single-cell profiles are goin to be simulated, single-cell profiles can be simulated and written to HDF5 files in blocks of block.size size by setting block.processing = TRUE.

References

Risso, D., Perraudeau, F., Gribkova, S. et al. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9, 284. doi: doi:10.1038/s41467-017-02554-5 .

Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978 .

Examples

set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10,
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
SDDLS <- createSpatialDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE,
  project = "Simul_example"
)
#> === Spatial transcriptomics data not provided
#> === Processing single-cell data
#>       - Filtering features:
#>          - Selected features: 15
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 15
SDDLS <- estimateZinbwaveParams(
  object = SDDLS,
  cell.type.column = "Cell_Type",
  cell.ID.column = "Cell_ID",
  gene.ID.column = "Gene_ID",
  subset.cells = 2,
  verbose = TRUE
)
#> === Setting parallel environment to 1 thread(s)
#> === Estimating parameters for all cell types in the experiment
#> === Creating cell model matrix based on Cell_Type columns:
#> 	~Cell_Type
#> === Number of cells for each cell type:
#>     - CellType1: 1
#>     - CellType2: 1
#> === Creating gene model matrix without gene covariates
#> === Running estimation process (Start time 16:03:52) 
#> === Removing genes without expression in any cell
#> >>> Fitting ZINB-WaVE model
#> Create model:
#> ok
#> Initialize parameters:
#> ok
#> Optimize parameters:
#> Iteration 1
#> penalized log-likelihood = -83.120329137823
#> After dispersion optimization = -56.133558863131
#>    user  system elapsed 
#>   0.027   0.000   0.027 
#> After right optimization = -54.6057641481247
#> After orthogonalization = -54.6057641481247
#>    user  system elapsed 
#>   0.012   0.000   0.012 
#> After left optimization = -54.6055360468243
#> After orthogonalization = -54.6055360468243
#> Iteration 2
#> penalized log-likelihood = -54.6055360468243
#> After dispersion optimization = -54.6055360468243
#>    user  system elapsed 
#>    0.02    0.00    0.02 
#> After right optimization = -54.605527995469
#> After orthogonalization = -54.605527995469
#>    user  system elapsed 
#>   0.008   0.000   0.008 
#> After left optimization = -54.6055197876494
#> After orthogonalization = -54.6055197876494
#> Iteration 3
#> penalized log-likelihood = -54.6055197876494
#> ok
#> 
#> DONE
#> 
#> Invested time: 0.15
SDDLS <- simSCProfiles(
  object = SDDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  n.cells = 2,
  verbose = TRUE
)
#> === Getting parameters from model:
#>     - mu: 2, 15
#>     - pi: 2, 15
#>     - Theta: 15
#> === Selected cell type(s) from ZINB-WaVE model (2 cell type(s)):
#>     - CellType2
#>     - CellType1
#> === Simulated matrix dimensions:
#>     - n (cells): 4
#>     - J (genes): 15
#>     - i (# entries): 60
#> 
#> DONE