Skip to contents

In this vignette, we will analyze a spatial transcriptomics dataset (10x Visium) comprising three slides from murine lymph nodes, two of which obtained after a 48-hour infection with Mycobacterium smegmatis (Lopez et al. (2022)). As a reference, we will use the paired single-cell RNA-seq (10x Chromium) data from the same study. The raw data is publicly available on GEO (GSE173778), but for ease of use, we have made it available through the SpatialDDLSdata R data package.

Loading data

Firstly, let’s load the required packages and data:

library("SpatialDDLS")
library("SingleCellExperiment")
library("SpatialExperiment")
library("ggplot2")
library("ggpubr")

## in case it is not installed
if (!requireNamespace("SpatialDDLSdata", quietly = TRUE)) {
  if (!requireNamespace("devtools", quietly = TRUE)) {
    install.packages("devtools")
  }
  devtools::install_github("diegommcc/SpatialDDLSdata")
}
library("SpatialDDLSdata")
# SingleCellExperiment with scRNA-seq
data(MouseDLN.SCE) 
# SpatialExperiment with spatial transcriptomics data
data(MouseDLN.ST)

Let’s explore the spatial transcriptomics data contained in the MouseDLN.ST object:

cbind(spatialCoords(MouseDLN.ST), colData(MouseDLN.ST)) %>% as.data.frame() %>% 
  ggplot(aes(X0, X1, color = lymph_node)) + 
  geom_point() + ggtitle("Mouse lymph nodes by condition") + 
  theme_classic() + coord_fixed()

With regard to the single-cell RNA-seq data, preprocessing and visualization could be performed, but such analysis is outside the scope of this tutorial.

Loading data into a SpatialDDLS object

Now, we need to create a SpatialDDLS object, which will serve as the central core for all subsequent steps. We suggest including both the spatial and single-cell transcriptomics data to enable filtering and selection of only those genes that are present in both data types for further analyses. Additionally, we recommend filtering genes based on their expression levels in order to reduce the number of dimensions and consider only meaningful genes. Please refer to the documentation to review the implemented strategies (specially sc.n.genes.per.cluster and sc.min.mean.counts parameters).

mouseDLN.SDDLS <- createSpatialDDLSobject(
  sc.data = MouseDLN.SCE, 
  sc.cell.ID.column = "CellID", 
  sc.gene.ID.column = "GeneSymbol",
  sc.cell.type.column = "broad_cell_types",
  st.data = MouseDLN.ST,
  sc.min.counts = 1, 
  sc.min.cells = 1,
  sc.filt.genes.cluster = TRUE, 
  sc.n.genes.per.cluster = 150,
  sc.min.mean.counts = 2,
  st.spot.ID.column = "CellID",
  st.gene.ID.column = "GeneSymbol"
)
## === 1 SpatialExperiment objects provided
##    === Processing spatial transcriptomics data
##       - Filtering features:
##          - Selected features: 13948
##          - Discarded features: 0
## 
## === Processing single-cell data
##       - Removing 16 genes without expression in any cell
## 'as(<dgCMatrix>, "dgTMatrix")' is deprecated.
## Use 'as(., "TsparseMatrix")' instead.
## See help("Deprecated") and help("Matrix-deprecated").
##       - Filtering features:
##          - Selected features: 12350
##          - Discarded features: 488
## 
## === Number of shared genes between single-cell and spatial transcriptomics datasets: 11592
##     - Original # genes in single-cell data: 12350
##     - Original # genes in ST data (object with the greatest # genes): 13948
## 
## === Number of removed mitochondrial genes: 12
## 
## === Number of genes after filtering based on logFC: 1058
## 
## === Final number of dimensions for further analyses: 1058

We can show some basic information about the object:

mouseDLN.SDDLS
## An object of class SpatialDDLS 
## Real single-cell profiles:
##   1058 features and 14989 cells
##   rownames: Tcrg-C2 Pax5 C1qc ... C1qc Rpl37a Prdx5 Rgs2 
##   colnames: TCAATTCCACTACAGT-1-2 AAGACTCAGTTCGCAT-1-1 GTCTGTCAGAGCCGAT-1-1 ... GTCTGTCAGAGCCGAT-1-1 AACAACCGTGCTCCGA-1-1 CTGCTCATCTGAATCG-1-3 GAGTTGTTCACACGAT-1-2 
## Spatial experiments:
##   1 experiments
##   1058 features and 1092 spots
##   rownames: Dntt Rps3a1 Cacnb3 ... Cacnb3 Ccr1 Rps27l Etv5 
##   colnames: ACATCAGCTGGGACGC-1-0 GAGATCTGCTTGGCAT-1-1 ATAAACGGACCCGTAA-1-1 ... ATAAACGGACCCGTAA-1-1 ACCTAATCGACTTCCT-1-0 GTTCGCCATAAGTGCC-1-1 AGGCTATGGTTAGCTT-1-0 
## Project: SpatialDDLS-Proj

In this case, we are only working on 1 spatial transcriptomics dataset, but an arbitrary number of SpatialExperiment objects can be loaded.

Simulation of mixed transcriptional profiles

Now, we are going to simulate cell composition matrices that will serve to simulate mixed transcriptional profiles with known cell proportions. This is done by the genMixedCellProp function in which we can control different aspects, such as the number of mixed transcriptional profiles to be generated or the number of cells used to simulate each mixed profile. These parameters must be decided depending on the size of the single-cell reference and the available computational resources. For this example, and as standard reference, we will use num.sim.spots = 10000 and n.cells = 50. The cell type composition of these mixed profiles will be generated by three methods:

  • A random sampling of a Dirichlet distribution. Within this set of samples, in order to make these proportions more sparse, the prob.sparity parameter controls the probability of having missing cell types in each simulated spot, as opposed to a mixture of all cell types.
  • Pure mixed transcriptional profiles composed of n.cells cells of the same cell type aggregated.
  • Transcriptional profiles in which a minimum number of missing cell types will be imposed. This is controlled by the min.zero.prop argument.

The relative abundance of samples generated by these criteria can be controlled by the proportion.method parameter. Finally, the genMixedCellProp function will automatically divide the reference cell profiles contained in the single.cell.real slot into training and test subsets and randomly assign n.cells cells to generate every mixed transcriptional profile.

mouseDLN.SDDLS <- genMixedCellProp(
  mouseDLN.SDDLS,
  cell.ID.column = "CellID",
  cell.type.column = "broad_cell_types",
  num.sim.spots = 10000,
  n.cells = 50,
  train.freq.cells = 2/3,
  train.freq.spots = 2/3,
  proportion.method = c(0, 0, 1),
  prob.sparity = 1, 
  min.zero.prop = 5,
  balanced.type.cells = TRUE
)
## 
## === The number of mixed profiles that will be generated is equal to 10000
## 
## === Training set cells by type:
##     - B cells: 5573
##     - CD4 T cells: 1362
##     - CD8 T cells: 2179
##     - cDC1s: 67
##     - cDC2s: 58
##     - GD T cells: 59
##     - Macrophages: 70
##     - Migratory DCs: 199
##     - Monocytes: 53
##     - NK cells: 62
##     - pDCs: 52
##     - Tregs: 260
## === Test set cells by type:
##     - B cells: 2786
##     - CD4 T cells: 681
##     - CD8 T cells: 1089
##     - cDC1s: 33
##     - cDC2s: 29
##     - GD T cells: 29
##     - Macrophages: 35
##     - Migratory DCs: 100
##     - Monocytes: 26
##     - NK cells: 31
##     - pDCs: 26
##     - Tregs: 130
## === Probability matrix for training data:
##     - Mixed spots: 6667
##     - Cell types: 12
## === Probability matrix for test data:
##     - Mixed spots: 3333
##     - Cell types: 12
## DONE

Then, we can call the simMixedProfiles function, which will generate the actual mixed transcriptional profiles using the cell composition matrices generated in the previous step. This step may take a while depending on the number of transcriptional profiles to be simulated, so be patient! In addition, users can choose the method by which the mixed profiles will be generated. We recommend summing up raw counts, and then normalizing samples to obtain logCPMs (mixing.function = "AddRawCount"), but other methods are available (see Documentation).

mouseDLN.SDDLS <- simMixedProfiles(mouseDLN.SDDLS, threads = 3)
## === Setting parallel environment to 3 thread(s)
## 
## === Generating train mixed profiles:
## 
## === Generating test mixed profiles:
## 
## DONE

Training a fully-connected neural network using mixed transcriptional profiles

Having generated a set of mixed transcriptional profiles with known cell composition, we can then train a neural network using the training subset and evaluate the model by predicting cell type proportions on the test subset. Once trained, the model can deconvolute the cellular composition of new transcriptional profiles, such as spots in a spatial transcriptomics experiment. The architecture of the network is fully customizable, although in our experience, the hyperparameters used in this example work for most of the cases. Particularly, we will employ a model with two hidden layers, each consisting of 200 neurons, and a training process involving 60 epochs.

mouseDLN.SDDLS <- trainDeconvModel(
  mouseDLN.SDDLS,
  num.epochs = 60,
  num.hidden.layers = 2, 
  num.units = c(200, 200),
  verbose = FALSE
) 
## 
  1/105 [..............................] - ETA: 12s - loss: 0.0916 - accuracy: 0.8438 - mean_absolute_error: 0.0191 - categorical_accuracy: 0.8438
 83/105 [======================>.......] - ETA: 0s - loss: 0.0800 - accuracy: 0.8942 - mean_absolute_error: 0.0165 - categorical_accuracy: 0.8942 
105/105 [==============================] - 0s 606us/step - loss: 0.0796 - accuracy: 0.8926 - mean_absolute_error: 0.0165 - categorical_accuracy: 0.8926

Some metrics of the training progress can be shown by setting verbose = TRUE. Anyhow, more advanced metrics can be calculated using the calculateEvalMetrics function. This function computes mean absolute error (MAE) and mean squared error (MSE) metrics per cell type, providing insight into the model’s performance for each cell type. These metrics can be visualized using various functions:

mouseDLN.SDDLS <- calculateEvalMetrics(mouseDLN.SDDLS)
distErrorPlot(
  mouseDLN.SDDLS,
  error = "AbsErr",
  x.by = "CellType",
  color.by = "CellType", 
  error.labels = FALSE, 
  type = "boxplot",
  size.point = 0.5
)

distErrorPlot(
  mouseDLN.SDDLS,
  x.by = "pBin",
  error = "AbsErr",
  facet.by = "CellType",
  color.by = "CellType", 
  error.label = TRUE,
  type = "boxplot"
)

corrExpPredPlot(
  mouseDLN.SDDLS,
  color.by = "CellType",
  facet.by = "CellType",
  corr = "both", 
  size.point = 0.5
)
## `geom_smooth()` using formula = 'y ~ x'

As demonstrated, the overall performance is satisfactory, indicating that the model is capable of comprehending the distinctive features of each cell type to provide precise predictions of the cell type composition of transcriptional profiles.

Deconvolution of the spatial transcriptomics dataset

Finally, we can use our trained model to deconvolute the signals of each spot using the deconvSpatialDDLS function. By default, this function uses the trained model to predict cell proportions of two sets of transcriptional profiles obtained from the ST datasets:

  • ‘Intrinsic’ profiles: these are the actual transcriptional profiles of every spot in the ST dataset.
  • ‘Extrinsic’ profiles: these are simulated profiles generated from the surrounding spots of every spot. The concept is to create a mirrored set of transcriptional profiles that represent the transcriptional features of the spatial context of each spot.

The latter can be used to understand how similar each spot is to its neighbors. Considering the hypothesis that we can infer the cellular composition of each spot based on its surroundings given the correlation between spatial location and cell composition/transcriptional features, we can use this information to spatially contextualize our predictions and improve their accuracy. We refer to this process as spatial regularization. Details about the methodology are explained in the Documentation and Mañanes et al. (2023).

mouseDLN.SDDLS <- deconvSpatialDDLS(
  mouseDLN.SDDLS, k.spots = 6, fast.pca = TRUE
)
##    No 'index.st' provided. Deconvoluting all SpatialExperiment objects contained in the `spatial.experiments` slot
## === Normalizing data (LogCPM)
## === Predicting cell type proportions
## 
 1/35 [..............................] - ETA: 0s
35/35 [==============================] - 0s 600us/step
## 
## === Calculating distances in PCA space
## 
## === Calculating 50 PCs
## === Calculating alpha factors based on distances
## DONE

Now, let’s project these predicted proportions in the spatial coordinates:

plotSpatialPropAll(mouseDLN.SDDLS, index.st = 1)

To reveal hidden patterns in the coordinates caused by using the same color scale, we can utilize the plotSpatialProp function to independently plot each cell type:

list.plots <- lapply(
  X = trained.model(mouseDLN.SDDLS) %>% cell.types(), 
  FUN = \(x) {
    plotSpatialProp(
        mouseDLN.SDDLS, index.st = 1, cell.type = x, size.point = 1,
        colors = "blues"
      ) + coord_fixed()
  }
)
ggarrange(plotlist = list.plots[1:4], align = "hv")

ggarrange(plotlist = list.plots[5:8], align = "hv")

ggarrange(plotlist = list.plots[9:12], align = "hv")

In addition to the ‘regularized’ cell proportions, we can plot the predictions calculated for the intrinsic and extrinsic transcriptional profiles. For instance, let’s plot those predicted from the extrinsic transcriptional profiles:

list.plots <- lapply(
  X = trained.model(mouseDLN.SDDLS) %>% cell.types(), 
  FUN = \(x) {
    plotSpatialProp(
        mouseDLN.SDDLS, index.st = 1, cell.type = x, size.point = 1,
        colors = "blues", prediction = "Extrinsic"
      ) + coord_fixed()
  }
)
ggarrange(plotlist = list.plots[1:4], align = "hv")

ggarrange(plotlist = list.plots[5:8], align = "hv")

ggarrange(plotlist = list.plots[9:12], align = "hv")

As one might expect, this is a smoothed version of the final predictions. It is also possible to visualize distances between the extrinsic and intrinsic transcriptional profiles of each spot to understand how the regularization step works by using the plotDistances function:

plotDistances(mouseDLN.SDDLS, index.st = 1, size.point = 1.5) + coord_fixed()

Those spots with distances less than the mean distance were regularized according to their nearest neighbor spots.

Interpreting the model

In order to make predictions more transparent, SpatialDDLS includes an additional module designed to provide insights into the model’s decision-making process. It relies on calculating the predicted classes/loss function gradients with respect to the input variables, a method popularly known as Vanilla Gradient. These numeric values are computed for each gene and cell type using the pure mixed transcriptional profiles previously simulated. Therefore, they can be interpreted as the extent to which each feature is contributing to the model’s predictions. While these values are initially calculated at the sample/gene level, they can be aggregated at the cell type level in order to assess the relevance of different genes for cell type proportion predictions. These steps are performed through the interGradientDL function:

mouseDLN.SDDLS <- interGradientsDL(
  mouseDLN.SDDLS, scaling = "standardize", method = "class"
)

Importantly, depending on the method parameter, positive and negative gradients must be differently interpreted:

  • If gradients with respect to the input variables were calculated using the loss function (method = "loss"), genes with negative gradients (those that minimize the loss function) will be positively correlated with the presence of each cell type.
  • Conversely, if gradients with respect to the input variables were calculated using classes (method = "class"), genes with positive gradients (those that make the probability of being a cell type higher) will be positively associated with each cell type.

It is important to note that these markers should not be interpreted as cell type markers. Rather, they serve as indications to help interpret the model’s performance. In addition, due to the multivariate nature of our approach, gradients are surrogates at the feature level for predictions made considering all input variables collectively, and thus caution should be exercised in drawing direct conclusions about specific gene-cell type relationships.

For this example, let’s calculate gradients of the class function with respect to the input features and show the top 5 genes with the greatest gradients per cell type:

top.gradients <- topGradientsCellType(
  mouseDLN.SDDLS, method = "class", top.n.genes = 5
)
sapply(
  top.gradients, \(x) x$Positive
) %>% as.data.frame()
##   B cells CD4 T cells CD8 T cells    cDC1s   cDC2s GD T cells   Macrophages
## 1    Snx2      Igfbp4       Cd8b1   Fam20c    Xist        Il4          Svbp
## 2  Vpreb3       Satb1        Cd8a   Slc1a3 Gm26917      Ramp1 2200002D01Rik
## 3   Cd79b       Ly6c1       Actn2 Hist1h3c Trav3-3      Itgb3         Timd4
## 4    Ska3        Lef1       Itgae   Sapcd2  Slc1a3      Ikzf3          Fth1
## 5  Slc1a3       Rps29        Acp5    Cox7b    Ryr2       Ugcg 1300017J02Rik
##   Migratory DCs Monocytes NK cells    pDCs    Tregs
## 1          Xist    Map4k4    Cdc20    Ryr2    Foxp3
## 2          Melk      Ryr2     Ryr2    Xist Ifi27l2a
## 3         Ly6c1      Pfn1   Tmsb10  Tagln2   Samhd1
## 4        Slc1a3   Cd300ld    Nuggc    Wnk1    Itgb1
## 5          Cd28     Pde1c     Jak1 Zfp36l1    Gpr83

As can be seen, among the top 5 genes some canonical markers for different cell types appear, such as Cd72 for B cells, Cd8a for CD8 T cells, or Foxp3 for Tregs. These are just the top 5 genes, so considering a higher number can provide a more comprehensive understanding of the genes utilizaed by the model.

We also provide the plotHeatmapGradsAgg function for visualizing the top N mean gradients per cell type. This plot highlights genes with high gradients across different cell types, reflecting the multivariate nature of neural networks. It is advisable to examine both the most positive and negative gradients, as these genes significantly contribute to the network’s predictions.

hh <- plotHeatmapGradsAgg(mouseDLN.SDDLS, top.n.genes = 4, method = "class")
hh$Positive

hh$Negative

Finally, we can use the plotSpatialGeneExpr to visualize the spatial distribution of the top N genes per cell type in the ST dataset. Let’s plot genes for some cell types just for demonstration purposes:

top.genes <- topGradientsCellType(mouseDLN.SDDLS, top.n.genes = 4)
for (i in c("B cells", "CD4 T cells", "CD8 T cells", "Tregs", "Monocytes")) {
  list.plots <- list()
  for (j in top.genes[[i]][["Absolute"]]) {
    list.plots[[j]] <- plotSpatialGeneExpr(
      mouseDLN.SDDLS, index.st = 1, gene = j, size.point = 0.5,
      title = paste0(i, " - ", j)
    ) + coord_fixed() + theme(legend.position = "none") ## legend removed just for viz
  }
  print(ggpubr::ggarrange(plotlist = list.plots, align = "hv"))
}

Clustering analysis

The SpatialDDLS R package also includes some functions to cluster the ST dataset according to the predicted cell composition of each spot. This functionality enables to dissect the ST datasets into distinct cellular niches, information that might be relevant for further analyses.

mouseDLN.SDDLS <- spatialPropClustering(mouseDLN.SDDLS, k.nn = 20)
##    No 'index.st' provided. Deconvoluting all SpatialExperiment objects contained in the `spatial.experiments` slot
## === Selected graph-based clustering
## === Running clustering for slide 1
##    No 'index.st' provided. Using first ST dataset
## === Plotting first clustering configuration Clustering.graph.k.20

Comparing deconvoluted cell proportions with colocalization of cell markers

Finally, we are going to assess whether there is a collocation between the predicted cell type proportions and the expression of known classic markers for each cell type. This analysis aims to validate the model’s predictions by comparing them with well-established cellular markers, but it does not mean to be a quantitative validation of the model.

customMarkers <- list(
  "B cells" = c("Cd74", "Cd19", "Cd79a", "Cd79b", "Ly6d"),
  "CD4 T cells" = c("Cd4", "Lef1", "Fyb"),
  "CD8 T cells" = c("Cd8b1", "Cd8a", "Trac"),
  cDC1s = c("Xcr1", "Irf8"),
  cDC2s = c("Irf4", "Cd4"),
  "GD T cells" = c("Il7r", "Id2"),
  Macrophages = c("Lyz2", "Lyz1", "Cd86", "Ly6c1"),
  "Migratory DCs" = c("Ccl5", "Anxa3", "Fscn1"),
  Monocytes = c("Fcer1g", "Cst3", "Lst1", "Itgam", "Kit", "Fcgr3"),
  "NK cells" = c("Nkg7", "Il2rb", "Gzma"),
  pDCs = c("Siglech", "Plac8", "Ly6c2", "Vtsb", "Zeb2", "Siglech"),
  Tregs = c("Ikzf2", "Il2ra", "Foxp3")
) %>% lapply(FUN = function(x) x[x %in% rownames(MouseDLN.ST)])
## calculate z-scores
exprST <- MouseDLN.ST@assays@data[[1]]
logCPM <- edgeR::cpm(exprST, log = TRUE)
meanZscoresCustom <- purrr::map(
  .x = names(customMarkers), 
  .f = ~{ colMeans(t(scale(t(logCPM[customMarkers[[.x]], , drop = FALSE])))) }
) %>% do.call(cbind, .) 
colnames(meanZscoresCustom) <- names(customMarkers)
color.z.scores <- rev(
  colorRampPalette(RColorBrewer::brewer.pal(n = 10, name = "RdBu"))(20)
)
st.coor <- SpatialExperiment::spatialCoords(
  spatial.experiments(object = mouseDLN.SDDLS, index.st = 1)
)
colnames(st.coor) <- paste("Spatial", 1:2)
dfPlotLong <- reshape2::melt(
  as.data.frame(cbind(st.coor, meanZscoresCustom)), 
  id.vars = c("Spatial 1", "Spatial 2"), 
  variable.name = "CellType", value.name = "Zscore"
)
dfPlotLong %>% ggplot(
  aes(x = .data[["Spatial 1"]], y = .data[["Spatial 2"]], color = Zscore)
) + geom_point(size = 0.5) + theme_classic()  + 
  ggtitle("Mean z-score of cell type markers") + 
  scale_color_gradientn(colors = color.z.scores, limit = c(-2, 2)) + 
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.title.x = element_blank(), axis.text.x = element_blank(),
    axis.ticks.x = element_blank(), axis.title.y = element_blank(),
    axis.text.y = element_blank(), axis.ticks.y = element_blank()
  ) + coord_fixed() + facet_wrap(~ CellType) 

References

Lopez, R., B. Li, H. Keren-Shaul, P. Boyeau, M. Kedmi, D. Pilzer, A. Jelinski, et al. 2022. DestVI identifies continuums of cell types in spatial transcriptomics data.” Nat Biotechnol 40 (9): 1360–69.
Mañanes, D., I. Rivero-García, D. Jimenez-Carretero, M. Torres, D. Sancho, C. Torroja, and F. Sanchez-Cabo. 2023. SpatialDDLS: An R package to deconvolute spatial transcriptomics data using neural networks.” bioRxiv.