Create a DigitalDLSorter object from single-cell RNA-seq data from files (formats allowed: tsv, tsv.gz, mtx (sparse matrix) and hdf5) or a SingleCellExperiment object. The data will be stored in single.cell.real slot. The data provided should consist of three pieces of information:

  • Single-cell counts: genes as rows and cells as columns.

  • Cells metadata: annotations (columns) for each cell (rows).

  • Genes metadata: annotations (columns) for each gene (rows).

If the data is provided from files, single.cell.real argument must be a vector of three elements ordered so that the first file corresponds to the count matrix, the second to the cells metadata and the last to the genes metadata. On the other hand, if the data is provided as a SingleCellExperiment oject, it must contain single-cell counts in the assay slot, cells metadata in the colData slot and genes metadata in the rowData. The data must be provided without any transformation (e.g. log-transformation) and raw counts are preferred.

loadSCProfiles(
  single.cell.data,
  cell.ID.column,
  gene.ID.column,
  name.dataset.h5,
  min.counts = 0,
  min.cells = 0,
  file.backend = NULL,
  name.dataset.backend = NULL,
  compression.level = NULL,
  chunk.dims = NULL,
  block.processing = FALSE,
  verbose = TRUE,
  project = "DigitalDLSorterProject"
)

Arguments

single.cell.data

If data is provided from files, single.cell.real must be a vector of three elements: single-cell counts, cells metadata and genes metadata. If data is provided from a SingleCellExperiment object, single-cell counts must be present in the assay slot, cells metadata in the colData slot and genes metadata in the rowData slot.

cell.ID.column

Name or number of the column in the cells metadata corresponding to cell names in expression matrix.

gene.ID.column

Name or number of the column in the genes metadata corresponding to the names used for features/genes.

name.dataset.h5

Name of the data set if HDF5 file is provided.

min.counts

Minimum gene counts to filter (0 by default).

min.cells

Minimum of cells with more than min.counts (0 by default).

file.backend

Valid file path where to store the loaded data as HDF5 file. If provided, data is stored in HDF5 files as back-end using DelayedArray and HDF5Array packages instead of being loaded into RAM. This is suitable for situations where you have large amounts of data that cannot be stored in memory. Note that operations on these data will be performed by blocks (i.e subsets of determined size), which may result in longer execution times. NULL by default.

name.dataset.backend

Name of the dataset of the HDF5 file to be used. Note that it cannot exist. If NULL (by default), a random dataset name will be used.

compression.level

The compression level used if file.backend is provided. It is an integer value between 0 (no compression) and 9 (highest and slowest compression). See ?getHDF5DumpCompressionLevel from the HDF5Array package for more information.

chunk.dims

Specifies dimensions that HDF5 chunk will have. If NULL, the default value is a vector of two items: the number of genes considered by DigitalDLSorter object during the simulation, and only one sample in order to increase read times in the following steps. A larger number of columns written in each chunk may lead to longer read times.

block.processing

Boolean indicating whether data should be treated as blocks (only if data is provided as HDF5 file). FALSE by default. Note that using this functionality is suitable for cases where is not possible to load the data into RAM and therefore execution times will be longer.

verbose

Show informative messages during the execution (TRUE by default).

project

Name of the project for DigitalDLSorter object.

Value

A DigitalDLSorter object with the single-cell RNA-seq data provided loaded into the single.cell.real slot as a SingleCellExperiment object.

Details

This data can be used to simulate new single-cell profiles using the ZINB-WaVE framework with the estimateZinbwaveParams function. In this way, it is possible to increase the signal of cell types that are underrepresented in the original dataset. If this step is not necessary, these profiles will be used directly to simulate pseudo-bulk RNA-seq samples with known cell composition.

Examples

set.seed(123) # reproducibility
sce <- SingleCellExperiment::SingleCellExperiment(
  assays = list(
    counts = matrix(
      rpois(100, lambda = 5), nrow = 40, ncol = 30,
      dimnames = list(paste0("Gene", seq(40)), paste0("RHC", seq(30)))
    )
  ),
  colData = data.frame(
    Cell_ID = paste0("RHC", seq(30)),
    Cell_Type = sample(x = paste0("CellType", seq(4)), size = 30,
                       replace = TRUE)
  ),
  rowData = data.frame(
    Gene_ID = paste0("Gene", seq(40))
  )
)
DDLS <- loadSCProfiles(
  single.cell.data = sce,
  cell.ID.column = "Cell_ID",
  gene.ID.column = "Gene_ID",
  min.cells = 0,
  min.counts = 0,
  project = "Simul_example"
)