Simulate new single-cell RNA-Seq expression profiles using the ZINB-WaVE model parameters
Source:R/simSingleCell.R
simSCProfiles.Rd
Simulate single-cell expression profiles by randomly sampling from a negative
binomial distribution and inserting dropouts by sampling from a binomial
distribution using the ZINB-WaVE parameters estimated by the
estimateZinbwaveParams
function.
Usage
simSCProfiles(
object,
cell.ID.column,
cell.type.column,
n.cells,
suffix.names = "_Simul",
cell.types = NULL,
file.backend = NULL,
name.dataset.backend = NULL,
compression.level = NULL,
block.processing = FALSE,
block.size = 1000,
chunk.dims = NULL,
verbose = TRUE
)
Arguments
- object
SpatialDDLS
object withsingle.cell.real
andzinb.params
slots.- cell.ID.column
Name or column number corresponding to the cell names of expression matrix in cells metadata.
- cell.type.column
Name or column number corresponding to the cell type of each cell in cells metadata.
- n.cells
Number of simulated cells generated per cell type (i.e. if you have 10 different cell types in your dataset, if
n.cells = 100
, then 1000 cell profiles will be simulated).- suffix.names
Suffix used on simulated cells. This suffix must be unique in the simulated cells, so make sure that this suffix does not appear in the real cell names.
- cell.types
Vector indicating the cell types to simulate. If
NULL
(by default),n.cells
single-cell profiles for all cell types will be simulated.- file.backend
Valid file path to store the simulated single-cell expression profiles as an HDF5 file (
NULL
by default). If provided, the data are stored in HDF5 files used as back-end by using the DelayedArray, HDF5Array and rhdf5 packages instead of loading all data into RAM memory. This is suitable for situations where you have large amounts of data that cannot be loaded into memory. Note that operations on this data will be performed in blocks (i.e subsets of determined size) which may result in longer execution times.- name.dataset.backend
Name of the dataset in HDF5 file to be used. Note that it cannot exist. If
NULL
(by default), a random dataset name will be used.- compression.level
The compression level used if
file.backend
is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See?getHDF5DumpCompressionLevel
from the HDF5Array package for more information.- block.processing
Boolean indicating whether the data should be simulated in blocks (only if
file.backend
is used,FALSE
by default). This functionality is suitable for cases where is not possible to load all data into memory and it leads to larger execution times.- block.size
Only if
block.processing = TRUE
. Number of single-cell expression profiles that will be simulated in each iteration during the process. Larger numbers result in higher memory usage but shorter execution times. Set according to available computational resources (1000 by default). Note that it cannot be greater than the total number of simulated cells.- chunk.dims
Specifies the dimensions that HDF5 chunk will have. If
NULL
, the default value is a vector of two items: the number of genes considered by the ZINB-WaVE model during the simulation and a single sample in order to reduce read times in the following steps. A larger number of columns written in each chunk can lead to longer read times in subsequent steps. Note that it cannot be greater than the dimensions of the simulated matrix.- verbose
Show informative messages during the execution (
TRUE
by default).
Value
A SpatialDDLS
object with
single.cell.simul
slot containing a
SingleCellExperiment
object with the simulated
single-cell expression profiles.
Details
Before this step, see ?estimateZinbwaveParams
. As described in
Torroja and Sanchez-Cabo, 2019, this function simulates a given number of
transcriptional profiles for each cell type provided by randomly sampling
from a negative binomial distribution with \(\mu\) and \(\theta\)
estimated parameters and inserting dropouts by sampling from a binomial
distribution with probability pi. All parameters are estimated from
single-cell real data using the estimateZinbwaveParams
function. It uses the ZINB-WaVE model (Risso et al., 2018). For more details
about the model, see ?estimateZinbwaveParams
and Risso et al.,
2018.
The file.backend
argument allows to create a HDF5 file with simulated
single-cell profiles to be used as back-end to work with data stored on disk
instead of loaded into RAM. If the file.backend
argument is used with
block.processing = FALSE
, all the single-cell profiles will be
simulated in one step and, therefore, loaded into in RAM memory. Then, data
will be written in HDF5 file. To avoid to collapse RAM memory if too many
single-cell profiles are goin to be simulated, single-cell profiles can be
simulated and written to HDF5 files in blocks of block.size
size by
setting block.processing = TRUE
.
References
Risso, D., Perraudeau, F., Gribkova, S. et al. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9, 284. doi: doi:10.1038/s41467-017-02554-5 .
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978 .
Examples
set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
SDDLS <- createSpatialDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
project = "Simul_example"
)
#> === Spatial transcriptomics data not provided
#> === Processing single-cell data
#> - Filtering features:
#> - Selected features: 15
#> - Discarded features: 0
#>
#> === No mitochondrial genes were found by using ^mt- as regrex
#>
#> === Final number of dimensions for further analyses: 15
SDDLS <- estimateZinbwaveParams(
object = SDDLS,
cell.type.column = "Cell_Type",
cell.ID.column = "Cell_ID",
gene.ID.column = "Gene_ID",
subset.cells = 2,
verbose = TRUE
)
#> === Setting parallel environment to 1 thread(s)
#> === Estimating parameters for all cell types in the experiment
#> === Creating cell model matrix based on Cell_Type columns:
#> ~Cell_Type
#> === Number of cells for each cell type:
#> - CellType1: 1
#> - CellType2: 1
#> === Creating gene model matrix without gene covariates
#> === Running estimation process (Start time 16:03:52)
#> === Removing genes without expression in any cell
#> >>> Fitting ZINB-WaVE model
#> Create model:
#> ok
#> Initialize parameters:
#> ok
#> Optimize parameters:
#> Iteration 1
#> penalized log-likelihood = -83.120329137823
#> After dispersion optimization = -56.133558863131
#> user system elapsed
#> 0.027 0.000 0.027
#> After right optimization = -54.6057641481247
#> After orthogonalization = -54.6057641481247
#> user system elapsed
#> 0.012 0.000 0.012
#> After left optimization = -54.6055360468243
#> After orthogonalization = -54.6055360468243
#> Iteration 2
#> penalized log-likelihood = -54.6055360468243
#> After dispersion optimization = -54.6055360468243
#> user system elapsed
#> 0.02 0.00 0.02
#> After right optimization = -54.605527995469
#> After orthogonalization = -54.605527995469
#> user system elapsed
#> 0.008 0.000 0.008
#> After left optimization = -54.6055197876494
#> After orthogonalization = -54.6055197876494
#> Iteration 3
#> penalized log-likelihood = -54.6055197876494
#> ok
#>
#> DONE
#>
#> Invested time: 0.15
SDDLS <- simSCProfiles(
object = SDDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
n.cells = 2,
verbose = TRUE
)
#> === Getting parameters from model:
#> - mu: 2, 15
#> - pi: 2, 15
#> - Theta: 15
#> === Selected cell type(s) from ZINB-WaVE model (2 cell type(s)):
#> - CellType2
#> - CellType1
#> === Simulated matrix dimensions:
#> - n (cells): 4
#> - J (genes): 15
#> - i (# entries): 60
#>
#> DONE