Simulate training and test mixed spot transcriptional profiles using cell
composition matrices generated by the genMixedCellProp
function.
Usage
simMixedProfiles(
object,
type.data = "both",
mixing.function = "AddRawCount",
file.backend = NULL,
compression.level = NULL,
block.processing = FALSE,
block.size = 1000,
chunk.dims = NULL,
threads = 1,
verbose = TRUE
)
Arguments
- object
SpatialDDLS
object withsingle.cell.real
/single.cell.simul
, andprob.cell.types
slots.- type.data
Type of data to generate:
'train'
,'test'
or'both'
(the last by default).- mixing.function
Function used to build mixed transcriptional profiles. It may be:
"AddRawCount"
: single-cell profiles (raw counts) are added up across cells. Then, log-CPMs are calculated (by default)."MeanCPM"
: single-cell profiles (raw counts) are transformed into CPMs and cross-cell averages are calculated. Then,log2(CPM + 1)
is calculated."AddCPM"
: single-cell profiles (raw counts) are transformed into CPMs and are added up across cells. Then, log-CPMs are calculated.
- file.backend
Valid file path to store simulated mixed expression profiles as an HDF5 file (
NULL
by default). If provided, data are stored in HDF5 files used as back-end by using the DelayedArray, HDF5Array and rhdf5 packages instead of loading all data into RAM. Note that operations on this matrix will be performed in blocks (i.e subsets of determined size) which may result in longer execution times.- compression.level
The compression level used if
file.backend
is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See?getHDF5DumpCompressionLevel
from the HDF5Array package for more information.- block.processing
Boolean indicating whether data should be simulated in blocks (only if
file.backend
is used,FALSE
by default). This functionality is suitable for cases where it is not possible to load all data into memory, and it leads to longer execution times.- block.size
Only if
block.processing = TRUE
. Number of mixed expression profiles that will be simulated in each iteration. Larger numbers result in higher memory usage but shorter execution times. Set accordingly to available computational resources (1000 by default).- chunk.dims
Specifies the dimensions that HDF5 chunk will have. If
NULL
, the default value is a vector of two items: the number of genes considered bySpatialDDLS
object during the simulation, and a single sample to reduce read times in the following steps. A larger number of columns written in each chunk can lead to longer read times.- threads
Number of threads used during simulation (1 by default).
- verbose
Show informative messages during the execution (
TRUE
by default).
Value
A SpatialDDLS
object with mixed.profiles
slot containing a list with one or two entries (depending on selected
type.data
argument): 'train'
and 'test'
. Each entry
consists of a SummarizedExperiment
object with the
simulated mixed slot profiles.
Details
Mixed profiles are generated under the assumption that the expression level
of a particular gene in a given spot is the sum of the expression levels of
the cell types that make it up weighted by their proportions. In practice, as
described in Torroja and Sanchez-Cabo, 2019, these profiles are generated by
summing gene expression levels of a determined number of cells specified by a
known cell composition matrix. The number of simulated spots and cells used
to simulate each spot are determined by the genMixedCellProp
function. This step can be avoided by using the on.the.fly
argument in
the trainDeconvModel
function.
SpatialDDLS allows to use HDF5 files as back-end to store simulated
data using the DelayedArray and HDF5Array packages. This
functionality allows to work without keeping the data loaded into RAM, which
could be useful during some computationally heavy steps such as neural
network training on RAM-limited machines. You must provide a valid file path
in the file.backend
argument to store the resulting file with the
'.h5' extension. This option slows down execution times, as subsequent
transformations of the data will be done in blocks. Note that if you use the
file.backend
argument with block.processing = FALSE
, all mixed
profiles will be simulated in one step and, thus, loaded into RAM. Then, the
matrix will be written to an HDF5 file. To avoid the RAM collapse, these
profiles can be simulated and written to HDF5 files in blocks of
block.size
size by setting block.processing = TRUE
. We
recommend this option accordingly to the computational resources available
and the number of simulated spots to be generated, but, in most of the cases,
it is not necessary.
References
Fischer B, Smith M and Pau, G (2020). rhdf5: R Interface to HDF5. R package version 2.34.0.
Pagès H, Hickey P and Lun A (2020). DelayedArray: A unified framework for working transparently with on-disk and in-memory array-like datasets. R package version 0.16.0.
Pagès H (2020). HDF5Array: HDF5 backend for DelayedArray objects. R package version 1.18.0.
Examples
set.seed(123)
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(100, lambda = 5), nrow = 40, ncol = 30,
dimnames = list(paste0("Gene", seq(40)), paste0("RHC", seq(30)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(30)),
Cell_Type = sample(x = paste0("CellType", seq(4)), size = 30,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(40))
)
)
SDDLS <- createSpatialDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
project = "Simul_example"
)
#> === Spatial transcriptomics data not provided
#> === Processing single-cell data
#> - Filtering features:
#> - Selected features: 40
#> - Discarded features: 0
#>
#> === No mitochondrial genes were found by using ^mt- as regrex
#>
#> === Final number of dimensions for further analyses: 40
SDDLS <- genMixedCellProp(
object = SDDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
num.sim.spots = 10,
train.freq.cells = 2/3,
train.freq.spots = 2/3,
verbose = TRUE
)
#>
#> === The number of mixed profiles that will be generated is equal to 10
#>
#> === Training set cells by type:
#> - CellType1: 5
#> - CellType2: 5
#> - CellType3: 5
#> - CellType4: 5
#> === Test set cells by type:
#> - CellType1: 2
#> - CellType2: 3
#> - CellType3: 3
#> - CellType4: 2
#> === Probability matrix for training data:
#> - Mixed spots: 7
#> - Cell types: 4
#> === Probability matrix for test data:
#> - Mixed spots: 3
#> - Cell types: 4
#> DONE
SDDLS <- simMixedProfiles(SDDLS, verbose = TRUE)
#> === Setting parallel environment to 1 thread(s)
#>
#> === Generating train mixed profiles:
#>
#> === Generating test mixed profiles:
#>
#> DONE