DigitalDLSorter
object from single-cell RNA-seq and bulk RNA-seq dataR/loadData.R
createDDLSobject.Rd
This function creates a DigitalDLSorter
object from
single-cell RNA-seq (SingleCellExperiment
object) and
bulk RNA-seq data to be deconvoluted (bulk.data
parameter)
as a SummarizedExperiment
object.
createDDLSobject(
sc.data,
sc.cell.ID.column,
sc.gene.ID.column,
sc.cell.type.column,
bulk.data,
bulk.sample.ID.column,
bulk.gene.ID.column,
bulk.name.data = "Bulk.DT",
filter.mt.genes = "^mt-",
sc.filt.genes.cluster = TRUE,
sc.min.mean.counts = 1,
sc.n.genes.per.cluster = 300,
top.n.genes = 2000,
sc.log.FC = TRUE,
sc.log.FC.cutoff = 0.5,
sc.min.counts = 1,
sc.min.cells = 1,
bulk.min.counts = 1,
bulk.min.samples = 1,
shared.genes = TRUE,
sc.name.dataset.h5 = NULL,
sc.file.backend = NULL,
sc.name.dataset.backend = NULL,
sc.compression.level = NULL,
sc.chunk.dims = NULL,
sc.block.processing = FALSE,
verbose = TRUE,
project = "DigitalDLSorter-Project"
)
Single-cell RNA-seq profiles to be used as reference. If data
are provided from files, single.cell.real
must be a vector of three
elements: single-cell counts, cells metadata and genes metadata. On the
other hand, If data are provided from a
SingleCellExperiment
object, single-cell counts must
be present in the assay
slot, cells metadata in the colData
slot, and genes metadata in the rowData
slot.
Name or number of the column in cells metadata corresponding to cell names in expression matrix (single-cell RNA-seq data).
Name or number of the column in genes metadata corresponding to the names used for features/genes (single-cell RNA-seq data).
Name or column number corresponding to cell types in cells metadata.
Bulk transcriptomics data to be deconvoluted. It has to be
a SummarizedExperiment
object.
Name or column number corresponding to sample IDs in samples metadata (bulk transcriptomics data).
Name or number of the column in the genes metadata corresponding to the names used for features/genes (bulk transcriptomics data).
Name of the bulk RNA-seq dataset ("Bulk.DT"
by
default).
Regular expression matching mitochondrial genes to
be ruled out (^mt-
by default). If NULL
, no filtering is
performed.
Whether to filter single-cell RNA-seq genes
according to a minimum threshold of non-zero average counts per cell type
(sc.min.mean.counts
). TRUE
by default.
Minimum non-zero average counts per cluster to filter genes. 1 by default.
Top n genes with the highest logFC per cluster (300 by default). See Details section for more details.
Maximum number of genes used for downstream steps (2000
by default). In case the number of genes after filtering is greater than
top.n.genes
, these genes will be set according to
variability across the whole single-cell dataset.
Whether to filter genes with a logFC less than 0.5 when
sc.filt.genes.cluster = TRUE
.
LogFC cutoff used if sc.log.FC == TRUE
.
Minimum gene counts to filter (1 by default; single-cell RNA-seq data).
Minimum of cells with more than min.counts
(1 by
default; single-cell RNA-seq data).
Minimum gene counts to filter (1 by default; bulk transcriptomics data).
Minimum of samples with more than min.counts
(1 by default; bulk transcriptomics data).
If set to TRUE
, only genes present in both the
single-cell and spatial transcriptomics data will be retained for further
processing (TRUE
by default).
Name of the data set if HDF5 file is provided for single-cell RNA-seq data.
Valid file path where to store the loaded for
single-cell RNA-seq data as HDF5 file. If provided, data are stored in a
HDF5 file as back-end using the DelayedArray and HDF5Array
packages instead of being loaded into RAM. This is suitable for situations
where you have large amounts of data that cannot be stored in memory. Note
that operations on these data will be performed by blocks (i.e subsets of
determined size), which may result in longer execution times. NULL
by default.
Name of the HDF5 file dataset to be used. Note
that it cannot exist. If NULL
(by default), a random dataset name
will be generated.
The compression level used if
sc.file.backend
is provided. It is an integer value between 0 (no
compression) and 9 (highest and slowest compression). See
?getHDF5DumpCompressionLevel
from the
HDF5Array package for more information.
Specifies dimensions that HDF5 chunk will have. If
NULL
, the default value is a vector of two items: the number of
genes considered by DigitalDLSorter
object during the
simulation, and only one sample in order to increase read times in the
following steps. A larger number of columns written in each chunk may lead
to longer read times.
Boolean indicating whether single-cell RNA-seq
data should be treated as blocks (only if data are provided as HDF5 file).
FALSE
by default. Note that using this functionality is suitable for
cases where it is not possible to load data into RAM and therefore
execution times will be longer.
Show informative messages during the execution (TRUE
by
default).
Name of the project for DigitalDLSorter
object.
A DigitalDLSorter
object with the single-cell
RNA-seq data provided loaded into the single.cell.real
slot as a
SingleCellExperiment
object. If bulk
transcriptomics data are provided, they will be stored in the
deconv.data
slot.
Filtering genes
In order to reduce the number of dimensions used for subsequent steps,
createSpatialDDLSobject
implements different strategies aimed at
removing useless genes for deconvolution:
Filtering at the
cell level: genes less expressed than a determined cutoff in N cells are
removed. See sc.min.cells
/bulk.min.samples
and
sc.min.counts
/bulk.min.counts
parameters.
Filtering at
the cluster level (only for scRNA-seq data): if
sc.filt.genes.cluster == TRUE
, createDDLSobject
sets a
cutoff of non-zero average counts per
cluster (sc.min.mean.counts
parameter) and take only the
sc.n.genes.per.cluster
genes with the highest logFC per cluster.
LogFCs are calculated using normalized logCPM of each cluster with respect to
the average in the whole dataset). Finally, if
the number of remaining genes is greater than top.n.genes
, genes are
ranked based on variance and the top.n.genes
most variable genes are
used for downstream analyses.
Single-cell RNA-seq data
Single-cell RNA-seq data can be provided from files (formats allowed: tsv,
tsv.gz, mtx (sparse matrix) and hdf5) or a
SingleCellExperiment
object. The data provided should
consist of three pieces of information:
Single-cell counts: genes as rows and cells as columns.
Cells metadata: annotations (columns) for each cell (rows).
Genes metadata: annotations (columns) for each gene (rows).
If the data is provided from files,
single.cell.real
argument must be a vector of three elements ordered
so that the first file corresponds to the count matrix, the second to the
cells metadata and the last to the genes metadata. On the other hand, if the
data is provided as a SingleCellExperiment
object, it
must contain single-cell counts in the assay
slot, cells metadata in
the colData
slot and genes metadata in the rowData
. The data
must be provided without any transformation (e.g. log-transformation) and raw
counts are preferred.
Bulk transcriptomics data
It must be a SummarizedExperiment
object (or a list of
them if samples from different experiments are going to be deconvoluted)
containing the same information as the single-cell RNA-seq data: the count
matrix, samples metadata (with IDs is enough), and genes metadata. Please,
make sure the gene identifiers used in the bulk and single-cell
transcriptomics data are consistent.
set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
assays = list(
counts = matrix(
rpois(100, lambda = 5), nrow = 40, ncol = 30,
dimnames = list(paste0("Gene", seq(40)), paste0("RHC", seq(30)))
)
),
colData = data.frame(
Cell_ID = paste0("RHC", seq(30)),
Cell_Type = sample(x = paste0("CellType", seq(4)), size = 30,
replace = TRUE)
),
rowData = data.frame(
Gene_ID = paste0("Gene", seq(40))
)
)
DDLS <- createDDLSobject(
sc.data = sce,
sc.cell.ID.column = "Cell_ID",
sc.gene.ID.column = "Gene_ID",
sc.min.cells = 0,
sc.min.counts = 0,
sc.log.FC = FALSE,
sc.filt.genes.cluster = FALSE,
project = "Simul_example"
)
#> === Bulk RNA-seq data not provided
#> === Processing single-cell data
#> - Filtering features:
#> - Selected features: 40
#> - Discarded features: 0
#>
#> === No mitochondrial genes were found by using ^mt- as regrex
#>
#> === Final number of dimensions for further analyses: 40