Train deconvolution model for spatial transcriptomics data

Train a deep neural network model using training data from the SpatialDDLS object. This model will be used to deconvolute spatial transcriptomics data from the same biological context as the single-cell RNA-seq data used to train it. In addition, the trained model is evaluated using test data, and prediction results are obtained to determine its performance (see ?calculateEvalMetrics).

Usage

trainDeconvModel(
  object,
  type.data.train = "mixed",
  type.data.test = "mixed",
  batch.size = 64,
  num.epochs = 60,
  num.hidden.layers = 2,
  num.units = c(200, 200),
  activation.fun = "relu",
  dropout.rate = 0.25,
  loss = "kullback_leibler_divergence",
  metrics = c("accuracy", "mean_absolute_error", "categorical_accuracy"),
  normalize = TRUE,
  scaling = "standardize",
  norm.batch.layers = TRUE,
  custom.model = NULL,
  shuffle = TRUE,
  sc.downsampling = NULL,
  use.generator = FALSE,
  on.the.fly = FALSE,
  agg.function = "AddRawCount",
  threads = 1,
  view.metrics.plot = TRUE,
  verbose = TRUE
)

Arguments

object

SpatialDDLS object with single.cell.real/single.cell.simul, prob.cell.types, and mixed.profiles slots (the last only if on.the.fly = FALSE).

type.data.train

Type of profiles to be used for training. It can be 'both', 'single-cell' or 'mixed' ('mixed' by default).

type.data.test

Type of profiles to be used for evaluation. It can be 'both', 'single-cell' or 'mixed' ('mixed' by default).

batch.size

Number of samples per gradient update (64 by default).

num.epochs

Number of epochs to train the model (60 by default).

num.hidden.layers

Number of hidden layers of the neural network (2 by default). This number must be equal to the length of num.units argument.

num.units

Vector indicating the number of neurons per hidden layer (c(200, 200) by default). The length of this vector must be equal to the num.hidden.layers argument.

activation.fun

Activation function ('relu' by default). See the keras documentation to know available activation functions.

dropout.rate

Float between 0 and 1 indicating the fraction of input neurons to be dropped in layer dropouts (0.25 by default). By default, SpatialDDLS implements 1 dropout layer per hidden layer.

loss

Character indicating loss function selected for model training ('kullback_leibler_divergence' by default). See the keras documentation to know available loss functions.

metrics

Vector of metrics used to assess model performance during training and evaluation (c("accuracy", "mean_absolute_error", "categorical_accuracy") by default). See the keras documentation to know available performance metrics.

normalize

Whether to normalize data using logCPM (TRUE by default). This parameter is only considered when the method used to simulate mixed transcriptional profiles (simMixedProfiles function) was "AddRawCount". Otherwise, data were already normalized.

scaling

How to scale data before training. It can be: "standardize" (values are centered around the mean with a unit standard deviation), "rescale" (values are shifted and rescaled so that they end up ranging between 0 and 1) or "none" (no scaling is performed). "standardize" by default.

norm.batch.layers

Whether to include batch normalization layers between each hidden dense layer (TRUE by default).

custom.model

It allows to use a custom neural network architecture. It must be a keras.engine.sequential.Sequential object in which the number of input neurons is equal to the number of considered features/genes, and the number of output neurons is equal to the number of cell types considered (NULL by default). If provided, the arguments related to the neural network architecture will be ignored.

shuffle

Boolean indicating whether data will be shuffled (TRUE by default).

sc.downsampling

It is only used if type.data.train is equal to 'both' or 'single-cell'. It allows to set a maximum number of single-cell profiles of a specific cell type for training to avoid an unbalanced representation of classes (NULL by default).

use.generator

Boolean indicating whether to use generators during training and test. Generators are automatically used when on.the.fly = TRUE or HDF5 files are used, but it can be activated by the user on demand (FALSE by default).

on.the.fly

Boolean indicating whether simulated data will be generated 'on the fly' during training (FALSE by default).

agg.function

If on.the.fly == TRUE, function used to build mixed transcriptional profiles. It may be:

"AddRawCount" (by default): single-cell profiles (raw counts) are added up across cells. Then, log-CPMs are calculated.
"MeanCPM": single-cell profiles (raw counts) are transformed into logCPM and cross-cell averages are calculated.
"AddCPM": single-cell profiles (raw counts) are transformed into CPMs and are added up across cells. Then, log-CPMs are calculated.

threads

Number of threads used during simulation of mixed transcriptional profiles if on.the.fly = TRUE (1 by default).

view.metrics.plot

Boolean indicating whether to show plots of loss and evaluation metrics during training (TRUE by default). keras for R allows to see model progression during training if you are working in RStudio.

verbose

Boolean indicating whether to display model progression during training and model architecture information (TRUE by default).

Value

A SpatialDDLS object with trained.model

slot containing a DeconvDLModel object. For more information about the structure of this class, see ?DeconvDLModel.

Details

Simulation of mixed transcriptional profiles 'on the fly'

trainDeconvModel can avoid storing simulated mixed spot profiles by using the on.the.fly argument. This functionality aims at reducing the the simMixedProfiles function's memory usage: simulated profiles are built in each batch during training/evaluation.

Neural network architecture

It is possible to change the model's architecture: number of hidden layers, number of neurons for each hidden layer, dropout rate, activation function, and loss function. For more customized models, it is possible to provide a pre-built model through the custom.model argument (a keras.engine.sequential.Sequential object) where it is necessary that the number of input neurons is equal to the number of considered features/genes, and the number of output neurons is equal to the number of considered cell types.

Examples

# \donttest{
set.seed(123)
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(30, lambda = 5), nrow = 15, ncol = 10,
      dimnames = list(paste0("Gene", seq(15)), paste0("RHC", seq(10)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(10)),
    Cell_Type = sample(x = paste0("CellType", seq(2)), size = 10,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(15))
  )
)
SDDLS <- createSpatialDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.filt.genes.cluster = FALSE
)
#> === Spatial transcriptomics data not provided
#> === Processing single-cell data
#>       - Filtering features:
#>          - Selected features: 15
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 15
SDDLS <- genMixedCellProp(
  object = SDDLS,
  cell.ID.column = "Cell_ID",
  cell.type.column = "Cell_Type",
  num.sim.spots = 50,
  train.freq.cells = 2/3,
  train.freq.spots = 2/3,
  verbose = TRUE
)
#> 
#> === The number of mixed profiles that will be generated is equal to 50
#> 
#> === Training set cells by type:
#>     - CellType1: 4
#>     - CellType2: 3
#> === Test set cells by type:
#>     - CellType1: 2
#>     - CellType2: 1
#> === Probability matrix for training data:
#>     - Mixed spots: 34
#>     - Cell types: 2
#> === Probability matrix for test data:
#>     - Mixed spots: 16
#>     - Cell types: 2
#> DONE
SDDLS <- simMixedProfiles(SDDLS)
#> === Setting parallel environment to 1 thread(s)
#> 
#> === Generating train mixed profiles:
#> 
#> === Generating test mixed profiles:
#> 
#> DONE
SDDLS <- trainDeconvModel(
  object = SDDLS,
  batch.size = 12,
  num.epochs = 5
)
#> === Training and test from stored data
#>     Using only simulated mixed samples
#>     Using only simulated mixed samples
#> Model: "SpatialDDLS"
#> _____________________________________________________________________
#> Layer (type)                   Output Shape               Param #    
#> =====================================================================
#> Dense1 (Dense)                 (None, 200)                3200       
#> _____________________________________________________________________
#> BatchNormalization1 (BatchNorm (None, 200)                800        
#> _____________________________________________________________________
#> Activation1 (Activation)       (None, 200)                0          
#> _____________________________________________________________________
#> Dropout1 (Dropout)             (None, 200)                0          
#> _____________________________________________________________________
#> Dense2 (Dense)                 (None, 200)                40200      
#> _____________________________________________________________________
#> BatchNormalization2 (BatchNorm (None, 200)                800        
#> _____________________________________________________________________
#> Activation2 (Activation)       (None, 200)                0          
#> _____________________________________________________________________
#> Dropout2 (Dropout)             (None, 200)                0          
#> _____________________________________________________________________
#> Dense3 (Dense)                 (None, 2)                  402        
#> _____________________________________________________________________
#> BatchNormalization3 (BatchNorm (None, 2)                  8          
#> _____________________________________________________________________
#> ActivationSoftmax (Activation) (None, 2)                  0          
#> =====================================================================
#> Total params: 45,410
#> Trainable params: 44,606
#> Non-trainable params: 804
#> _____________________________________________________________________
#> 
#> === Training DNN with 34 samples:
#> 
#> === Evaluating DNN in test data (16 samples)
#>    - loss: NaN
#>    - accuracy: 0.5
#>    - mean_absolute_error: NaN
#>    - categorical_accuracy: 0.5
#> 
#> === Generating prediction results using test data
#> DONE
# }