Generate training and test cell composition matrices for the simulation of
pseudo-bulk RNA-Seq samples with known cell composition using single-cell
expression profiles. The resulting ProbMatrixCellTypes
object contains a matrix that determines the proportion of the different cell
types that will compose the simulated pseudo-bulk samples. In addition, this
object also contains other information relevant for the process. This
function does not simulate pseudo-bulk samples, this task is performed by the
simBulkProfiles
or trainDDLSModel
functions (see Documentation).
generateBulkCellMatrix(
object,
cell.ID.column,
cell.type.column,
prob.design,
num.bulk.samples,
n.cells = 100,
train.freq.cells = 3/4,
train.freq.bulk = 3/4,
proportion.method = c(10, 5, 20, 15, 35, 15),
prob.sparsity = 0.5,
min.zero.prop = NULL,
balanced.type.cells = FALSE,
verbose = TRUE
)
DigitalDLSorter
object with
single.cell.real
slot and, optionally, with single.cell.simul
slot.
Name or column number corresponding to the cell names of expression matrix in cells metadata.
Name or column number corresponding to the cell type of each cell in cells metadata.
Data frame with the expected frequency ranges for each cell type present in the experiment. This information can be estimated from literature or from the single-cell experiment itself. This data frame must be constructed by three columns with specific headings (see examples):
A cell type column with the same name of the cell type
column in cells metadata (cell.type.column
). If the name of the
column is not the same, the function will return an error. All cell types
must appear in the cells metadata.
A second column called
'from'
with the start frequency for each cell type.
A third
column called 'to'
with the ending frequency for each cell type.
Number of bulk RNA-Seq sample proportions (and thus
simulated bulk RNA-Seq samples) to be generated taking into account
training and test data. We recommend seting this value according to the
number of single-cell profiles available in
DigitalDLSorter
object avoiding an excesive
re-sampling, but generating a large number of samples for better training.
Number of cells that will be aggregated in order to simulate one bulk RNA-Seq sample (100 by default).
Proportion of cells used to simulate training pseudo-bulk samples (2/3 by default).
Proportion of bulk RNA-Seq samples to the total number
(num.bulk.samples
) used for the training set (2/3 by default).
Vector of six integers that determines the proportions of bulk samples generated by the different methods (see Details and Torroja and Sanchez-Cabo, 2019. for more information). This vector represents proportions, so its entries must add up 100. By default, a majority of random samples will be generated without using predefined ranges.
It only affects the proportions generated by the first method (Dirichlet distribution). It determines the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types. A higher value for this parameter will result in more sparse simulated samples.
This parameter controls the minimum number of cell types
that will be absent in each simulated spot. If NULL
(by default),
this value will be half of the total number of different cell types, but
increasing it will result in more spots composed of fewer cell types. This
helps to create more sparse proportions and cover a wider range of
situations during model training.
Boolean indicating whether the training and test
cells will be split in a balanced way considering the cell types
(FALSE
by default).
Show informative messages during the execution (TRUE
by
default).
A DigitalDLSorter
object with
prob.cell.types
slot containing a list
with two
ProbMatrixCellTypes
objects (training and test). For
more information about the structure of this class, see
?ProbMatrixCellTypes
.
First, the available single-cell profiles are split into training and test
subsets (2/3 for training and 1/3 for test by default (see
train.freq.cells
)) to avoid falsifying the results during model
evaluation. Next, num.bulk.samples
bulk samples proportions are built
and the single-cell profiles to be used to simulate each pseudo-bulk RNA-Seq
sample are set, being 100 cells per bulk sample by default (see
n.cells
argument). The proportions of training and test pseudo-bulk
samples are set by train.freq.bulk
(2/3 for training and 1/3 for
testing by default). Finally, in order to avoid biases due to the composition
of the pseudo-bulk RNA-Seq samples, cell type proportions (\(w_1,...,w_k\),
where \(k\) is the number of cell types available in single-cell profiles)
are randomly generated by using six different approaches:
Cell proportions are randomly sampled from a truncated
uniform distribution with predefined limits according to a priori knowledge
of the abundance of each cell type (see prob.design
argument). This
information can be inferred from the single-cell experiment itself or from
the literature.
A second set is generated by randomly permuting cell type labels from a distribution generated by the previous method.
Cell proportions are randomly sampled as by method 1 without replacement.
Using the last method for generating proportions, cell types labels are randomly sampled.
Cell proportions are randomly sampled from a Dirichlet distribution.
Pseudo-bulk RNA-Seq samples composed of the same cell type are generated in order to provide 'pure' pseudo-bulk samples.
If you want to inspect the distribution of cell type proportions generated by
each method during the process, they can be visualized by the
showProbPlot
function (see Documentation).
Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978
set.seed(123) # reproducibility
# simulated data
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(30, lambda = 5), nrow = 15, ncol = 10,
dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(10)),
Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(15))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.filt.genes.cluster = FALSE,
sc.log.FC = FALSE
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#> - Filtering features:
#> - Selected features: 15
#> - Discarded features: 0
#>
#> === No mitochondrial genes were found by using ^mt- as regrex
#>
#> === Final number of dimensions for further analyses: 15
probMatrixValid <- data.frame(
Cell_Type = paste0("CellType", seq(2)),
from = c(1, 30),
to = c(15, 70)
)
DDLS <- generateBulkCellMatrix(
object = DDLS,
cell.ID.column = "Cell_ID",
cell.type.column = "Cell_Type",
prob.design = probMatrixValid,
num.bulk.samples = 10,
verbose = TRUE
)
#>
#> === The number of bulk RNA-Seq samples that will be generated is equal to 10
#>
#> === Training set cells by type:
#> - CellType1: 4
#> - CellType2: 3
#> === Test set cells by type:
#> - CellType1: 2
#> - CellType2: 1
#> === Probability matrix for training data:
#> - Bulk RNA-Seq samples: 8
#> - Cell types: 2
#> === Probability matrix for test data:
#> - Bulk RNA-Seq samples: 2
#> - Cell types: 2
#> DONE