Estimate the parameters of the ZINB-WaVE model using a real single-cell RNA-Seq data set as reference to simulate new single-cell profiles and increase the signal of underrepresented cell types. This step is optional, only is needed if the size of you dataset is too small or there are underrepresented cell types in order to train the Deep Neural Network model in a more balanced way. After this step, the simSCProfiles function will use the estimated parameters to simulate new single-cell profiles. See ?simSCProfiles for more information.

estimateZinbwaveParams(
  object,
  cell.type.column,
  cell.ID.column,
  gene.ID.column,
  cell.cov.columns,
  gene.cov.columns,
  subset.cells = NULL,
  proportional = TRUE,
  set.type = "All",
  threads = 1,
  verbose = TRUE
)

Arguments

object

DigitalDLSorter object with a single.cell.real slot.

cell.type.column

Name or column number corresponding to the cell type of each cell in cells metadata.

cell.ID.column

Name or column number corresponding to the cell names of expression matrix in cells metadata.

gene.ID.column

Name or column number corresponding to the notation used for features/genes in genes metadata.

cell.cov.columns

Name or column number(s) in cells metadata to be used as covariates during model fitting (if no covariates are used, set to empty or NULL).

gene.cov.columns

Name or column number(s) in genes metadata that will be used as covariates during model fitting (if no covariates are used, set to empty or NULL).

subset.cells

Number of cells to fit the ZINB-WaVE model. Useful when the original data set is too large to fit the model. Set a value according to the original data set and the resources available on your computer. If NULL (by default), all cells will be used. Must be an integer greater than or equal to the number of cell types considered and less than or equal to the total number of cells.

proportional

If TRUE, the original cell type proportions in the subset of cells generated by subset.cells will not be altered as far as possible. If FALSE, all cell types will have the same number of cells as far as possible (TRUE by default).

set.type

Cell type(s) to evaluate ('All' by default). It is recommended fitting the model to all cell types rather than using only a subset of them to capture the total variability present in the original experiment even if only a subset of cell types is simulated.

threads

Number of threads used for estimation (1 by default). To set up the parallel environment, the BiocParallel package must be installed.

verbose

Show informative messages during the execution (TRUE by default).

Value

A DigitalDLSorter object with zinb.params slot containing a ZinbParametersModel object. This object contains a slot with the estimated ZINB-WaVE parameters from the real single-cell RNA-Se`q data.

Details

ZINB-WaVE is a flexible model for zero-inflated count data. This function carries out the model fit to real single-cell data modeling \(Y_{ij}\) (the count of feature \(j\) for sample \(i\)) as a random variable following a zero-inflated negative binomial (ZINB) distribution. The estimated parameters will be used for the simulation of new single-cell expression profiles by sampling a negative binomial distribution and inserting dropouts from a binomial distribution. To do so, digitalDLSorteR uses the zinbFit function from the zinbwave package (Risso et al., 2018). For more details about the model, see Risso et al., 2018.

References

Risso, D., Perraudeau, F., Gribkova, S. et al. (2018). A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9, 284. doi: doi:10.1038/s41467-017-02554-5 .

Torroja, C. and Sánchez-Cabo, F. (2019). digitalDLSorter: A Deep Learning algorithm to quantify immune cell populations based on scRNA-Seq data. Frontiers in Genetics 10, 978. doi: doi:10.3389/fgene.2019.00978 .

See also

Examples

set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10,
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE, 
  sc.log.FC = FALSE
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#> 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
#> Use 'as(., "TsparseMatrix")' instead.
#> See help("Deprecated") and help("Matrix-deprecated").
#>       - Filtering features:
#>          - Selected features: 15
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 15
DDLS <- estimateZinbwaveParams(
  object = DDLS,
  cell.type.column = "Cell_Type",
  cell.ID.column = "Cell_ID",
  gene.ID.column = "Gene_ID",
  subset.cells = 2,
  verbose = TRUE
)
#> === Setting parallel environment to 1 thread(s)
#> === Estimating parameters for all cell types in the experiment
#> === Creating cell model matrix based on Cell_Type columns:
#> 	~Cell_Type
#> === Number of cells for each cell type:
#>     - CellType1: 1
#>     - CellType2: 1
#> === Creating gene model matrix without gene covariates
#> === Running estimation process (Start time 16:47:28) 
#> === Removing genes without expression in any cell
#> >>> Fitting ZINB-WaVE model
#> Create model:
#> ok
#> Initialize parameters:
#> ok
#> Optimize parameters:
#> Iteration 1
#> penalized log-likelihood = -83.1203291378805
#> After dispersion optimization = -56.1335588623031
#>    user  system elapsed 
#>    0.03    0.00    0.03 
#> After right optimization = -54.6057641481274
#> After orthogonalization = -54.6057641481274
#>    user  system elapsed 
#>   0.015   0.000   0.014 
#> After left optimization = -54.6055360468269
#> After orthogonalization = -54.6055360468269
#> Iteration 2
#> penalized log-likelihood = -54.6055360468269
#> After dispersion optimization = -54.6055360468269
#>    user  system elapsed 
#>   0.023   0.000   0.022 
#> After right optimization = -54.6055279954711
#> After orthogonalization = -54.6055279954711
#>    user  system elapsed 
#>    0.01    0.00    0.01 
#> After left optimization = -54.6055197876521
#> After orthogonalization = -54.6055197876521
#> Iteration 3
#> penalized log-likelihood = -54.6055197876521
#> ok
#> 
#> DONE
#> 
#> Invested time: 1.13