Generate training and test cell composition matrices for the simulation of pseudo-bulk RNA-Seq samples with known cell composition using single-cell expression profiles. The resulting ProbMatrixCellTypes object contains a matrix that determines the proportion of the different cell types that will compose the simulated pseudo-bulk samples. In addition, this object also contains other information relevant for the process. This function does not simulate pseudo-bulk samples, this task is performed by the simBulkProfiles or trainDDLSModel functions (see Documentation).

generateBulkCellMatrix(
  object,
  cell.ID.column,
  cell.type.column,
  prob.design,
  num.bulk.samples,
  n.cells = 100,
  train.freq.cells = 3/4,
  train.freq.bulk = 3/4,
  proportion.method = c(10, 5, 20, 15, 35, 15),
  prob.sparsity = 0.5,
  min.zero.prop = NULL,
  balanced.type.cells = FALSE,
  verbose = TRUE
)

Arguments

object

DigitalDLSorter object with single.cell.real slot and, optionally, with single.cell.simul slot.

cell.ID.column

Name or column number corresponding to the cell names of expression matrix in cells metadata.

cell.type.column

Name or column number corresponding to the cell type of each cell in cells metadata.

prob.design

Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples):

  • A cell type column with the same name of the cell type column in cells metadata (cell.type.column). If the name of the column is not the same, the function will return an error. All cell types must appear in the cells metadata.

  • A second column called 'from' with the start frequency for each cell type.

  • A third column called 'to' with the ending frequency for each cell type.

num.bulk.samples

Number of bulk RNA-Seq sample proportions (and thus simulated bulk RNA-Seq samples) to be generated taking into account training and test data. We recommend seting this value according to the number of single-cell profiles available in DigitalDLSorter object avoiding an excesive re-sampling, but generating a large number of samples for better training.

n.cells

Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default).

train.freq.cells

Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default).

train.freq.bulk

Proportion of bulk RNA-Seq samples to the total number (num.bulk.samples) used for the training set (2/3 by default).

proportion.method

Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges.

prob.sparsity

It only affects the proportions generated by the first method (Dirichlet distribution). It determines the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types. A higher value for this parameter will result in more sparse simulated samples.

min.zero.prop

This parameter controls the minimum number of cell types that will be absent in each simulated spot. If NULL (by default), this value will be half of the total number of different cell types, but increasing it will result in more spots composed of fewer cell types. This helps to create more sparse proportions and cover a wider range of situations during model training.

balanced.type.cells

Boolean indicating whether the training and test cells will be split in a balanced way considering the cell types (FALSE by default).

verbose

Show informative messages during the execution (TRUE by default).

Value

A DigitalDLSorter object with prob.cell.types slot containing a list with two ProbMatrixCellTypes objects (training and test). For more information about the structure of this class, see ?ProbMatrixCellTypes.

Details

First, the available single-cell profiles are split into training and test subsets (2/3 for training and 1/3 for test by default (see train.freq.cells)) to avoid falsifying the results during model evaluation. Next, num.bulk.samples bulk samples proportions are built and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq sample are set, being 100 cells per bulk sample by default (see n.cells argument). The proportions of training and test pseudo-bulk samples are set by train.freq.bulk (2/3 for training and 1/3 for testing by default). Finally, in order to avoid biases due to the composition of the pseudo-bulk RNA-Seq samples, cell type proportions (\(w_1,...,w_k\), where \(k\) is the number of cell types available in single-cell profiles) are randomly generated by using six different approaches:

  1. Cell proportions are randomly sampled from a truncated uniform distribution with predefined limits according to a priori knowledge of the abundance of each cell type (see prob.design argument). This information can be inferred from the single-cell experiment itself or from the literature.

  2. A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.

  3. Cell proportions are randomly sampled as by method 1 without replacement.

  4. Using the last method for generating proportions, cell types labels are randomly sampled.

  5. Cell proportions are randomly sampled from a Dirichlet distribution.

  6. Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.

If you want to inspect the distribution of cell type proportions generated by each method during the process, they can be visualized by the showProbPlot function (see Documentation).

References

Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978

Examples

set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10, 
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10, 
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE, 
  sc.log.FC = FALSE
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#>       - Filtering features:
#>          - Selected features: 15
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 15
probMatrixValid <- data.frame(
  Cell_Type = paste0("CellType", seq(2)),
  from = c(1, 30),
  to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
  object = DDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  prob.design = probMatrixValid,
  num.bulk.samples = 10,
  verbose = TRUE
)
#> 
#> === The number of bulk RNA-Seq samples that will be generated is equal to 10
#> 
#> === Training set cells by type:
#>     - CellType1: 4
#>     - CellType2: 3
#> === Test set cells by type:
#>     - CellType1: 2
#>     - CellType2: 1
#> === Probability matrix for training data:
#>     - Bulk RNA-Seq samples: 8
#>     - Cell types: 2
#> === Probability matrix for test data:
#>     - Bulk RNA-Seq samples: 2
#>     - Cell types: 2
#> DONE