R/simSingleCell.R
simSCProfiles.Rd
Simulate single-cell expression profiles by randomly sampling from a negative
binomial distribution and inserting dropouts by sampling from a binomial
distribution using the ZINB-WaVE parameters estimated by the
estimateZinbwaveParams
function.
simSCProfiles(
object,
cell.ID.column,
cell.type.column,
n.cells,
suffix.names = "_Simul",
cell.types = NULL,
file.backend = NULL,
name.dataset.backend = NULL,
compression.level = NULL,
block.processing = FALSE,
block.size = 1000,
chunk.dims = NULL,
verbose = TRUE
)
DigitalDLSorter
object with
single.cell.real
and zinb.params
slots.
Name or column number corresponding to the cell names of expression matrix in cells metadata.
Name or column number corresponding to the cell type of each cell in cells metadata.
Number of simulated cells generated per cell type (i.e. if you
have 10 different cell types in your dataset, if n.cells = 100
, then
1000 cell profiles will be simulated).
Suffix used on simulated cells. This suffix must be unique in the simulated cells, so make sure that this suffix does not appear in the real cell names.
Vector indicating the cell types to simulate. If
NULL
(by default), n.cells
single-cell profiles for all cell
types will be simulated.
Valid file path to store the simulated single-cell
expression profiles as an HDF5 file (NULL
by default). If provided,
the data is stored in HDF5 files used as back-end by using the
DelayedArray, HDF5Array and rhdf5 packages instead of
loading all data into RAM memory. This is suitable for situations where you
have large amounts of data that cannot be loaded into memory. Note that
operations on this data will be performed in blocks (i.e subsets of
determined size) which may result in longer execution times.
Name of the dataset in HDF5 file to be used. Note
that it cannot exist. If NULL
(by default), a random dataset name
will be used.
The compression level used if file.backend
is
provided. It is an integer value between 0 (no compression) and 9 (highest
and slowest compression). See
?getHDF5DumpCompressionLevel
from the
HDF5Array package for more information.
Boolean indicating whether the data should be
simulated in blocks (only if file.backend
is used, FALSE
by
default). This functionality is suitable for cases where is not possible to
load all data into memory and it leads to larger execution times.
Only if block.processing = TRUE
. Number of
single-cell expression profiles that will be simulated in each iteration
during the process. Larger numbers result in higher memory usage but
shorter execution times. Set according to available computational resources
(1000 by default). Note that it cannot be greater than the total number of
simulated cells.
Specifies the dimensions that HDF5 chunk will have. If
NULL
, the default value is a vector of two items: the number of
genes considered by the ZINB-WaVE model during the simulation and a single
sample in order to reduce read times in the following steps. A larger
number of columns written in each chunk can lead to longer read times in
subsequent steps. Note that it cannot be greater than the dimensions of the
simulated matrix.
Show informative messages during the execution (TRUE
by
default).
A DigitalDLSorter
object with
single.cell.simul
slot containing a
SingleCellExperiment
object with the simulated
single-cell expression profiles.
Before this step, see ?estimateZinbwaveParams
. As described in
Torroja and Sanchez-Cabo, 2019, this function simulates a given number of
transcriptional profiles for each cell type provided by randomly sampling
from a negative binomial distribution with \(\mu\) and \(\theta\)
estimated parameters and inserting dropouts by sampling from a binomial
distribution with probability pi. All parameters are estimated from
single-cell real data using the estimateZinbwaveParams
function. It uses the ZINB-WaVE model (Risso et al., 2018). For more details
about the model, see ?estimateZinbwaveParams
and Risso et al.,
2018.
The file.backend
argument allows to create a HDF5 file with simulated
single-cell profiles to be used as back-end to work with data stored on disk
instead of loaded into RAM. If the file.backend
argument is used with
block.processing = FALSE
, all the single-cell profiles will be
simulated in one step and, therefore, loaded into in RAM memory. Then, data
will be written in HDF5 file. To avoid to collapse RAM memory if too many
single-cell profiles are simulated, single-cell profiles can be simulated and
written to HDF5 files in blocks of block.size
size by setting
block.processing = TRUE
.
Risso, D., Perraudeau, F., Gribkova, S. et al. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9, 284. doi: doi:10.1038/s41467-017-02554-5 .
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978 .
set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
sc.log.FC = FALSE
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#> - Filtering features:
#> - Selected features: 15
#> - Discarded features: 0
#>
#> === No mitochondrial genes were found by using ^mt- as regrex
#>
#> === Final number of dimensions for further analyses: 15
DDLS <- estimateZinbwaveParams(
object = DDLS,
cell.type.column = "Cell_Type",
cell.ID.column = "Cell_ID",
gene.ID.column = "Gene_ID",
subset.cells = 4,
verbose = FALSE
)
DDLS <- simSCProfiles(
object = DDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
n.cells = 2,
verbose = TRUE
)
#> === Getting parameters from model:
#> - mu: 4, 15
#> - pi: 4, 15
#> - Theta: 15
#> === Selected cell type(s) from ZINB-WaVE model (2 cell type(s)):
#> - CellType2
#> - CellType1
#> === Simulated matrix dimensions:
#> - n (cells): 4
#> - J (genes): 15
#> - i (# entries): 60
#>
#> DONE