This function creates a DigitalDLSorter object from single-cell RNA-seq (SingleCellExperiment object) and bulk RNA-seq data to be deconvoluted (bulk.data parameter) as a SummarizedExperiment object.

createDDLSobject(
  sc.data,
  sc.cell.ID.column,
  sc.gene.ID.column,
  sc.cell.type.column,
  bulk.data,
  bulk.sample.ID.column,
  bulk.gene.ID.column,
  bulk.name.data = "Bulk.DT",
  filter.mt.genes = "^mt-",
  sc.filt.genes.cluster = TRUE,
  sc.min.mean.counts = 1,
  sc.n.genes.per.cluster = 300,
  top.n.genes = 2000,
  sc.log.FC = TRUE,
  sc.log.FC.cutoff = 0.5,
  sc.min.counts = 1,
  sc.min.cells = 1,
  bulk.min.counts = 1,
  bulk.min.samples = 1,
  shared.genes = TRUE,
  sc.name.dataset.h5 = NULL,
  sc.file.backend = NULL,
  sc.name.dataset.backend = NULL,
  sc.compression.level = NULL,
  sc.chunk.dims = NULL,
  sc.block.processing = FALSE,
  verbose = TRUE,
  project = "DigitalDLSorter-Project"
)

Arguments

sc.data

Single-cell RNA-seq profiles to be used as reference. If data are provided from files, single.cell.real must be a vector of three elements: single-cell counts, cells metadata and genes metadata. On the other hand, If data are provided from a SingleCellExperiment object, single-cell counts must be present in the assay slot, cells metadata in the colData slot, and genes metadata in the rowData slot.

sc.cell.ID.column

Name or number of the column in cells metadata corresponding to cell names in expression matrix (single-cell RNA-seq data).

sc.gene.ID.column

Name or number of the column in genes metadata corresponding to the names used for features/genes (single-cell RNA-seq data).

sc.cell.type.column

Name or column number corresponding to cell types in cells metadata.

bulk.data

Bulk transcriptomics data to be deconvoluted. It has to be a SummarizedExperiment object.

bulk.sample.ID.column

Name or column number corresponding to sample IDs in samples metadata (bulk transcriptomics data).

bulk.gene.ID.column

Name or number of the column in the genes metadata corresponding to the names used for features/genes (bulk transcriptomics data).

bulk.name.data

Name of the bulk RNA-seq dataset ("Bulk.DT" by default).

filter.mt.genes

Regular expression matching mitochondrial genes to be ruled out (^mt- by default). If NULL, no filtering is performed.

sc.filt.genes.cluster

Whether to filter single-cell RNA-seq genes according to a minimum threshold of non-zero average counts per cell type (sc.min.mean.counts). TRUE by default.

sc.min.mean.counts

Minimum non-zero average counts per cluster to filter genes. 1 by default.

sc.n.genes.per.cluster

Top n genes with the highest logFC per cluster (300 by default). See Details section for more details.

top.n.genes

Maximum number of genes used for downstream steps (2000 by default). In case the number of genes after filtering is greater than top.n.genes, these genes will be set according to variability across the whole single-cell dataset.

sc.log.FC

Whether to filter genes with a logFC less than 0.5 when sc.filt.genes.cluster = TRUE.

sc.log.FC.cutoff

LogFC cutoff used if sc.log.FC == TRUE.

sc.min.counts

Minimum gene counts to filter (1 by default; single-cell RNA-seq data).

sc.min.cells

Minimum of cells with more than min.counts (1 by default; single-cell RNA-seq data).

bulk.min.counts

Minimum gene counts to filter (1 by default; bulk transcriptomics data).

bulk.min.samples

Minimum of samples with more than min.counts (1 by default; bulk transcriptomics data).

shared.genes

If set to TRUE, only genes present in both the single-cell and spatial transcriptomics data will be retained for further processing (TRUE by default).

sc.name.dataset.h5

Name of the data set if HDF5 file is provided for single-cell RNA-seq data.

sc.file.backend

Valid file path where to store the loaded for single-cell RNA-seq data as HDF5 file. If provided, data are stored in a HDF5 file as back-end using the DelayedArray and HDF5Array packages instead of being loaded into RAM. This is suitable for situations where you have large amounts of data that cannot be stored in memory. Note that operations on these data will be performed by blocks (i.e subsets of determined size), which may result in longer execution times. NULL by default.

sc.name.dataset.backend

Name of the HDF5 file dataset to be used. Note that it cannot exist. If NULL (by default), a random dataset name will be generated.

sc.compression.level

The compression level used if sc.file.backend is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See ?getHDF5DumpCompressionLevel from the HDF5Array package for more information.

sc.chunk.dims

Specifies dimensions that HDF5 chunk will have. If NULL, the default value is a vector of two items: the number of genes considered by DigitalDLSorter object during the simulation, and only one sample in order to increase read times in the following steps. A larger number of columns written in each chunk may lead to longer read times.

sc.block.processing

Boolean indicating whether single-cell RNA-seq data should be treated as blocks (only if data are provided as HDF5 file). FALSE by default. Note that using this functionality is suitable for cases where it is not possible to load data into RAM and therefore execution times will be longer.

verbose

Show informative messages during the execution (TRUE by default).

project

Name of the project for DigitalDLSorter object.

Value

A DigitalDLSorter object with the single-cell RNA-seq data provided loaded into the single.cell.real slot as a SingleCellExperiment object. If bulk transcriptomics data are provided, they will be stored in the deconv.data slot.

Details

Filtering genes

In order to reduce the number of dimensions used for subsequent steps, createSpatialDDLSobject implements different strategies aimed at removing useless genes for deconvolution:

  • Filtering at the cell level: genes less expressed than a determined cutoff in N cells are removed. See sc.min.cells/bulk.min.samples and sc.min.counts/bulk.min.counts parameters.

  • Filtering at the cluster level (only for scRNA-seq data): if sc.filt.genes.cluster == TRUE, createDDLSobject sets a cutoff of non-zero average counts per cluster (sc.min.mean.counts parameter) and take only the sc.n.genes.per.cluster genes with the highest logFC per cluster. LogFCs are calculated using normalized logCPM of each cluster with respect to the average in the whole dataset). Finally, if the number of remaining genes is greater than top.n.genes, genes are ranked based on variance and the top.n.genes most variable genes are used for downstream analyses.

Single-cell RNA-seq data

Single-cell RNA-seq data can be provided from files (formats allowed: tsv, tsv.gz, mtx (sparse matrix) and hdf5) or a SingleCellExperiment object. The data provided should consist of three pieces of information:

  • Single-cell counts: genes as rows and cells as columns.

  • Cells metadata: annotations (columns) for each cell (rows).

  • Genes metadata: annotations (columns) for each gene (rows).

If the data is provided from files, single.cell.real argument must be a vector of three elements ordered so that the first file corresponds to the count matrix, the second to the cells metadata and the last to the genes metadata. On the other hand, if the data is provided as a SingleCellExperiment object, it must contain single-cell counts in the assay slot, cells metadata in the colData slot and genes metadata in the rowData. The data must be provided without any transformation (e.g. log-transformation) and raw counts are preferred.

Bulk transcriptomics data

It must be a SummarizedExperiment object (or a list of them if samples from different experiments are going to be deconvoluted) containing the same information as the single-cell RNA-seq data: the count matrix, samples metadata (with IDs is enough), and genes metadata. Please, make sure the gene identifiers used in the bulk and single-cell transcriptomics data are consistent.

Examples

set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(100, lambda = 5), nrow = 40, ncol = 30,
      dimnames = list(paste0("Gene", seq(40)), paste0("RHC", seq(30)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(30)),
    Cell_Type = sample(x = paste0("CellType", seq(4)), size = 30,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(40))
  )
)
DDLS <- createDDLSobject(
  sc.data = sce,
  sc.cell.ID.column = "Cell_ID",
  sc.gene.ID.column = "Gene_ID",
  sc.min.cells = 0,
  sc.min.counts = 0,
  sc.log.FC = FALSE,
  sc.filt.genes.cluster = FALSE,
  project = "Simul_example"
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#>       - Filtering features:
#>          - Selected features: 40
#>          - Discarded features: 0
#> 
#> === No mitochondrial genes were found by using ^mt- as regrex
#> 
#> === Final number of dimensions for further analyses: 40