Simulate training and test pseudo-bulk RNA-Seq profiles using the cell composition matrices generated by the generateBulkCellMatrix function. The samples are generated under the assumption that the expression level of the \(i\) gene in the \(j\) bulk sample is given by the sum of the expression levels of the cell types \(X_{ijk}\) that make them up weighted by the proportions of these \(k\) cell types in each sample. In practice, as described in Torroja and Sanchez-Cabo, 2019, these profiles are generated by summing a number of cells of different cell types determined by proportions from a matrix of known cell composition. The number of simulated pseudo-bulk RNA-Seq samples and the number of cells composing each sample are determined by generateBulkCellMatrix (see Documentation) Note: this step can be avoided by using the on.the.fly argument in the trainDDLSModel function. See Documentation for more information.

simBulkProfiles(
  object,
  type.data = "both",
  pseudobulk.function = "AddRawCount",
  file.backend = NULL,
  compression.level = NULL,
  block.processing = FALSE,
  block.size = 1000,
  chunk.dims = NULL,
  threads = 1,
  verbose = TRUE
)

Arguments

object

DigitalDLSorter object with single.cell.real/single.cell.simul and prob.cell.types slots.

type.data

Type of data to generate between 'train', 'test' or 'both' (the last by default).

pseudobulk.function

Function used to build pseudo-bulk samples. It may be:

  • "MeanCPM": single-cell profiles (raw counts) are transformed into CPMs and cross-cell averages are calculated. Then, log2(CPM + 1) is calculated.

  • "AddCPM": single-cell profiles (raw counts) are transformed into CPMs and are added up across cells. Then, log-CPMs are calculated.

  • "AddRawCount": single-cell profiles (raw counts) are added up across cells. Then, log-CPMs are calculated.

file.backend

Valid file path to store the simulated single-cell expression profiles as an HDF5 file (NULL by default). If provided, the data is stored in HDF5 files used as back-end by using the DelayedArray, HDF5Array and rhdf5 packages instead of loading all data into RAM memory. This is suitable for situations where you have large amounts of data that cannot be loaded into memory. Note that operations on this data will be performed in blocks (i.e subsets of determined size) which may result in longer execution times.

compression.level

The compression level used if file.backend is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See ?getHDF5DumpCompressionLevel from the HDF5Array package for more information.

block.processing

Boolean indicating whether the data should be simulated in blocks (only if file.backend is used, FALSE by default). This functionality is suitable for cases where is not possible to load all data into memory and it leads to larger execution times.

block.size

Only if block.processing = TRUE. Number of pseudo-bulk expression profiles that will be simulated in each iteration during the process. Larger numbers result in higher memory usage but shorter execution times. Set according to available computational resources (1000 by default).

chunk.dims

Specifies the dimensions that HDF5 chunk will have. If NULL, the default value is a vector of two items: the number of genes considered by DigitalDLSorter object during the simulation, and a single sample to reduce read times in the following steps. A larger number of columns written in each chunk can lead to longer read times.

threads

Number of threads used during the simulation of pseudo-bulk samples (1 by default). Set according to computational resources and avoid it if block.size will be used.

verbose

Show informative messages during the execution (TRUE by default).

Value

A DigitalDLSorter object with bulk.simul slot containing a list with one or two entries (depending on selected type.data argument): 'train' and 'test'. Each entry contains a SummarizedExperiment object with simulated bulk samples in the assay slot, sample names in the colData slot and feature names in the rowData slot.

Details

digitalDLSorteR allows the use of HDF5 files as back-end to store the resulting data using the DelayedArray and HDF5Array packages. This functionality allows to work without keeping the data loaded into RAM, which could be of vital importance during some computationally heavy steps such as neural network training on RAM-limited machines. You must provide a valid file path in the file.backend argument to store the resulting file with the '.h5' extension. The data will be accessible from R without being loaded into memory. This option slightly slows down execution times, as subsequent transformations of the data will be done in blocks rather than using all the data. We recommend this option according to the computational resources available and the number of pseudo-bulk samples to be generated.

Note that if you use the file.backend argument with block.processing = FALSE, all pseudo-bulk profiles will be simulated in one step and, therefore, loaded into RAM. Then, the data will be written to an HDF5 file. To avoid the RAM collapse, pseudo-bulk profiles can be simulated and written to HDF5 files in blocks of block.size size by setting block.processing = TRUE.

It is possible to avoid this step by using the on.the.fly argument in the trainDDLSModel function. In this way, data is generated 'on the fly' during the neural network training. For more details, see ?trainDDLSModel.

References

Fischer B, Smith M and Pau, G (2020). rhdf5: R Interface to HDF5. R package version 2.34.0.

Pagès H, Hickey P and Lun A (2020). DelayedArray: A unified framework for working transparently with on-disk and in-memory array-like datasets. R package version 0.16.0.

Pagès H (2020). HDF5Array: HDF5 backend for DelayedArray objects. R package version 1.18.0.

Examples

set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10,
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE, 
  sc.log.FC = FALSE
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#>       - Filtering features:
#>          - Selected features: 15
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 15
probMatrixValid <- data.frame(
  Cell_Type = paste0("CellType", seq(2)),
  from = c(1, 30),
  to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
  object = DDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  prob.design = probMatrixValid,
  num.bulk.samples = 10,
  verbose = TRUE
)
#> 
#> === The number of bulk RNA-Seq samples that will be generated is equal to 10
#> 
#> === Training set cells by type:
#>     - CellType1: 4
#>     - CellType2: 3
#> === Test set cells by type:
#>     - CellType1: 2
#>     - CellType2: 1
#> === Probability matrix for training data:
#>     - Bulk RNA-Seq samples: 8
#>     - Cell types: 2
#> === Probability matrix for test data:
#>     - Bulk RNA-Seq samples: 2
#>     - Cell types: 2
#> DONE
DDLS <- simBulkProfiles(DDLS, verbose = TRUE)
#> === Setting parallel environment to 1 thread(s)
#> 
#> === Generating train bulk samples:
#> 
#> === Generating test bulk samples:
#> 
#> DONE