Accurate cell type annotation is essential but challenging in single-cell RNA sequencing (scRNA-seq). Manual approaches are time-consuming, subjective, and inconsistent, while automated classifiers often fail to generalize across tissues, conditions, or closely related cell types. A key limitation for both is the lack of true ground truth for benchmarking.
scTypeEval provides a
ground-truth-agnostic framework to systematically
assess annotation quality using internal validation metrics. It
enables:
Load the package:
For this vignette, we’ll create a synthetic dataset representing single-cell RNA-seq data from multiple samples with distinct cell types. This simulates a typical scenario where we want to evaluate the consistency of cell type annotations across biological replicates.
set.seed(22)
# Number of genes and cells
n_genes <- 5000
n_cells_per_sample <- 500
n_samples <- 6
cell_types <- c("CD4_T", "CD8_T", "B_cells", "NK_cells")
# Generate gene names
gene_names <- paste0("Gene_", seq_len(n_genes))
# Total cells
total_cells <- n_cells_per_sample * n_samples
# Generate metadata: samples and cell types
samples <- rep(paste0("Sample", seq_len(n_samples)), each = n_cells_per_sample)
# Assign cell types evenly within each sample
n_cell_types <- length(cell_types)
cells_per_type <- n_cells_per_sample %/% n_cell_types
celltypes <- rep(rep(cell_types, each = cells_per_type), times = n_samples)
# Initialize count matrix
count_matrix <- matrix(0, nrow = n_genes, ncol = total_cells)
rownames(count_matrix) <- gene_names
colnames(count_matrix) <- paste0("Cell_", seq_len(total_cells))
# Add cell-type-specific expression patterns
for (i in seq_along(cell_types)) {
ct <- cell_types[i]
ct_indices <- which(celltypes == ct)
# Marker genes for this cell type (50 genes per type)
marker_start <- ((i - 1) * 50 + 1)
marker_end <- min(marker_start + 49, n_genes)
marker_genes <- marker_start:marker_end
# High expression in marker genes (lambda = 50)
count_matrix[marker_genes, ct_indices] <- matrix(
rpois(length(marker_genes) * length(ct_indices), lambda = 50),
nrow = length(marker_genes)
)
# Background expression in other genes (lambda = 5)
other_genes <- setdiff(seq_len(n_genes), marker_genes)
count_matrix[other_genes, ct_indices] <- matrix(
rpois(length(other_genes) * length(ct_indices), lambda = 5),
nrow = length(other_genes)
)
}
# Add sample-specific batch effects
for (s in seq_len(n_samples)) {
sample_cells <- which(samples == paste0("Sample", s))
batch_effect <- rnorm(1, mean = 1, sd = 0.1)
count_matrix[, sample_cells] <- round(
count_matrix[, sample_cells] * batch_effect
)
}
# Convert to sparse matrix
count_matrix <- Matrix(count_matrix, sparse = TRUE)
# Create metadata dataframe
metadata <- data.frame(
celltype = celltypes,
sample = samples,
batch = rep(
c("Batch1", "Batch2"),
each = n_cells_per_sample * (n_samples/2)
),
row.names = colnames(count_matrix),
stringsAsFactors = FALSE
)
# Preview the data
head(metadata)
#> celltype sample batch
#> Cell_1 CD4_T Sample1 Batch1
#> Cell_2 CD4_T Sample1 Batch1
#> Cell_3 CD4_T Sample1 Batch1
#> Cell_4 CD4_T Sample1 Batch1
#> Cell_5 CD4_T Sample1 Batch1
#> Cell_6 CD4_T Sample1 Batch1
dim(count_matrix)
#> [1] 5000 3000scTypeEval objects can be created from:
# Create scTypeEval object from matrix and metadata
sceval <- create_scTypeEval(
matrix = count_matrix,
metadata = metadata
)The scTypeEval object stores the raw count data and
metadata, which will be processed in subsequent steps.
Process and normalize the data, creating both single-cell and pseudobulk representations. This step:
# Process data with filtering criteria
sceval <- run_processing_data(
scTypeEval = sceval,
ident = "celltype", # Column defining cell type identities
sample = "sample", # Column defining sample IDs
min_samples = 3, # Minimum samples required per cell type
min_cells = 10 # Minimum cells per sample-celltype combination
)The processed data is stored as data_assay objects
within the scTypeEval object, accessible for downstream
analysis.
Identify biologically relevant features for evaluating consistency.
scTypeEval supports:
# Identify HVGs using the basic method
sceval <- run_hvg(
scTypeEval = sceval,
var_method = "basic", # Method for variance modeling: "basic" or "scran"
ngenes = 1000, # Number of HVGs to retain
sample = TRUE # Perform sample-level blocking
)
# View stored gene lists
gene_lists <- sceval@gene_lists
names(gene_lists)
#> [1] "HVG"
if ("HVG" %in% names(gene_lists)) {
length(gene_lists$HVG)
}
#> [1] 1000# Identify marker genes per cell type
sceval <- run_gene_markers(
scTypeEval = sceval,
method = "scran.findMarkers", # Method for finding markers
ngenes_celltype = 50 # Maximum markers per cell type
)
# View marker genes
gene_lists <- sceval@gene_lists
if (!is.null(gene_lists$scran.findMarkers)) {
head(gene_lists$scran.findMarkers, 20)
}
#> [1] "Gene_119" "Gene_131" "Gene_125" "Gene_104" "Gene_149" "Gene_129"
#> [7] "Gene_147" "Gene_138" "Gene_133" "Gene_103" "Gene_112" "Gene_128"
#> [13] "Gene_121" "Gene_115" "Gene_148" "Gene_107" "Gene_116" "Gene_124"
#> [19] "Gene_110" "Gene_143"You can also add custom gene lists using
add_gene_list():
# Example: Add a custom list of immune response genes
custom_genes <- c("Gene_1", "Gene_50", "Gene_100", "Gene_150")
sceval <- add_gene_list(
scTypeEval = sceval,
gene_list = list("custom_genes" = custom_genes)
)These different selected features are stored within the scTypeEval object and can be used in the subsequent steps.
Consistency metrics can be computed directly on the selected features. However, computing them in a low-dimensional space (e.g., PCA) significantly speeds up the analysis while yielding similar results.
# Run PCA on processed data
sceval <- run_pca(
scTypeEval = sceval,
ndim = 20, # Number of principal components
gene_list = "HVG" # specify gene list to subset if multiple were added
)
# View available reductions
reductions <- sceval@reductions
if (length(reductions) > 0) {
names(reductions)
}
#> [1] "single-cell" "pseudobulk"Alternatively, pre-computed embeddings (PCA, UMAP, t-SNE) can be
inserted using add_dim_reduction().
The core of scTypeEval is computing pairwise
dissimilarities between cell types across samples. Multiple
dissimilarity methods are supported:
Compute distances between pseudobulk gene expression profiles:
# Euclidean distance on pseudobulk profiles
sceval <- run_dissimilarity(
sceval,
method = "Pseudobulk:Euclidean",
reduction = FALSE # Use processed expression data of selected features
)
# Cosine distance on PCA embeddings
sceval <- run_dissimilarity(
sceval,
method = "Pseudobulk:Cosine",
reduction = TRUE # Use PCA space
)Compute Wasserstein distances between single-cell distributions:
Match cells across samples using a classifier:
Evaluate inter-sample consistency for each cell type using internal
validation metrics. scTypeEval supports multiple
consistency metrics:
# Compute silhouette consistency on Euclidean dissimilarity
consistency_sil <- get_consistency(
sceval,
dissimilarity_slot = "Pseudobulk:Euclidean",
consistency_metric = "silhouette"
)
# View results
print(consistency_sil)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 0.9392103 silhouette Pseudobulk:Euclidean celltype
#> CD8.T CD8.T 0.9372690 silhouette Pseudobulk:Euclidean celltype
#> B.cells B.cells 0.9382785 silhouette Pseudobulk:Euclidean celltype
#> NK.cells NK.cells 0.9370830 silhouette Pseudobulk:Euclidean celltypeHigher values indicate better consistency (samples with the same cell type annotation are more similar to each other than to samples with different annotations).
# Compute neighborhood purity on Wasserstein dissimilarity
consistency_nbhd <- get_consistency(
sceval,
dissimilarity_slot = "WasserStein",
consistency_metric = "NeighborhoodPurity"
)
# View results
print(consistency_nbhd)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 1 NeighborhoodPurity WasserStein celltype
#> CD8.T CD8.T 1 NeighborhoodPurity WasserStein celltype
#> B.cells B.cells 1 NeighborhoodPurity WasserStein celltype
#> NK.cells NK.cells 1 NeighborhoodPurity WasserStein celltypeconsistency_all <-
get_consistency(
sceval,
dissimilarity_slot = c("Pseudobulk:Euclidean",
"WasserStein"), # allow multiple arguments
consistency_metric = c("silhouette",
"NeighborhoodPurity",
"Average_similarity") # allow multiple arguments
)
print(consistency_all)
#> celltype measure consistency_metric dissimilarity_method ident
#> CD4.T CD4.T 0.9392103 silhouette Pseudobulk:Euclidean celltype
#> CD8.T CD8.T 0.9372690 silhouette Pseudobulk:Euclidean celltype
#> B.cells B.cells 0.9382785 silhouette Pseudobulk:Euclidean celltype
#> NK.cells NK.cells 0.9370830 silhouette Pseudobulk:Euclidean celltype
#> CD4.T1 CD4.T 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> CD8.T1 CD8.T 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> B.cells1 B.cells 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> NK.cells1 NK.cells 1.0000000 NeighborhoodPurity Pseudobulk:Euclidean celltype
#> CD4.T2 CD4.T 0.9427627 Average_similarity Pseudobulk:Euclidean celltype
#> CD8.T2 CD8.T 0.9410225 Average_similarity Pseudobulk:Euclidean celltype
#> B.cells2 B.cells 0.9419125 Average_similarity Pseudobulk:Euclidean celltype
#> NK.cells2 NK.cells 0.9408865 Average_similarity Pseudobulk:Euclidean celltype
#> CD4.T3 CD4.T 0.9008518 silhouette WasserStein celltype
#> CD8.T3 CD8.T 0.9009151 silhouette WasserStein celltype
#> B.cells3 B.cells 0.8998458 silhouette WasserStein celltype
#> NK.cells3 NK.cells 0.8996690 silhouette WasserStein celltype
#> CD4.T11 CD4.T 1.0000000 NeighborhoodPurity WasserStein celltype
#> CD8.T11 CD8.T 1.0000000 NeighborhoodPurity WasserStein celltype
#> B.cells11 B.cells 1.0000000 NeighborhoodPurity WasserStein celltype
#> NK.cells11 NK.cells 1.0000000 NeighborhoodPurity WasserStein celltype
#> CD4.T21 CD4.T 0.9099102 Average_similarity WasserStein celltype
#> CD8.T21 CD8.T 0.9099267 Average_similarity WasserStein celltype
#> B.cells21 B.cells 0.9090413 Average_similarity WasserStein celltype
#> NK.cells21 NK.cells 0.9089435 Average_similarity WasserStein celltypeVisualize dissimilarity matrices as annotated heatmaps. Cell types can be ordered by similarity or consistency scores:
# Plot dissimilarity heatmap with consistency-based ordering
plot_heatmap(
sceval,
dissimilarity_slot = "Pseudobulk:Euclidean",
sort_consistency = "silhouette", # Order by silhouette scores
sort_similarity = NULL # Or order by similarity
)The heatmap shows pairwise dissimilarities between samples, grouped by cell type. Cell types with higher consistency (e.g., more coherent annotations) show tighter clustering of their samples.
Cell types with low consistency scores may indicate:
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] SingleCellExperiment_1.35.1 SummarizedExperiment_1.43.0
#> [3] Biobase_2.73.1 GenomicRanges_1.65.0
#> [5] Seqinfo_1.3.0 IRanges_2.47.2
#> [7] S4Vectors_0.51.3 BiocGenerics_0.59.6
#> [9] generics_0.1.4 MatrixGenerics_1.25.0
#> [11] matrixStats_1.5.0 Seurat_5.5.0
#> [13] SeuratObject_5.4.0 sp_2.2-1
#> [15] Matrix_1.7-5 scTypeEval_1.1.0
#> [17] BiocStyle_2.41.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_2.0.0
#> [4] magrittr_2.0.5 spatstat.utils_3.2-3 farver_2.1.2
#> [7] rmarkdown_2.31 vctrs_0.7.3 ROCR_1.0-12
#> [10] spatstat.explore_3.8-1 S4Arrays_1.13.0 htmltools_0.5.9
#> [13] BiocNeighbors_2.7.2 SparseArray_1.13.2 sass_0.4.10
#> [16] sctransform_0.4.3 parallelly_1.47.0 KernSmooth_2.23-26
#> [19] bslib_0.11.0 htmlwidgets_1.6.4 ica_1.0-3
#> [22] plyr_1.8.9 plotly_4.12.0 zoo_1.8-15
#> [25] cachem_1.1.0 buildtools_1.0.0 igraph_2.3.1
#> [28] mime_0.13 lifecycle_1.0.5 pkgconfig_2.0.3
#> [31] rsvd_1.0.5 R6_2.6.1 fastmap_1.2.0
#> [34] fitdistrplus_1.2-6 future_1.70.0 shiny_1.13.0
#> [37] digest_0.6.39 patchwork_1.3.2 tensor_1.5.1
#> [40] dqrng_0.4.1 RSpectra_0.16-2 irlba_2.3.7
#> [43] beachmat_2.29.0 labeling_0.4.3 progressr_0.19.0
#> [46] spatstat.sparse_3.2-0 httr_1.4.8 polyclip_1.10-7
#> [49] abind_1.4-8 compiler_4.6.0 withr_3.0.2
#> [52] S7_0.2.2 BiocParallel_1.47.0 fastDummies_1.7.6
#> [55] MASS_7.3-65 DelayedArray_0.39.3 bluster_1.23.0
#> [58] tools_4.6.0 lmtest_0.9-40 otel_0.2.0
#> [61] httpuv_1.6.17 future.apply_1.20.2 goftest_1.2-3
#> [64] glue_1.8.1 nlme_3.1-169 promises_1.5.0
#> [67] grid_4.6.0 Rtsne_0.17 cluster_2.1.8.2
#> [70] reshape2_1.4.5 gtable_0.3.6 spatstat.data_3.1-9
#> [73] tidyr_1.3.2 data.table_1.18.4 metapod_1.21.0
#> [76] ScaledMatrix_1.21.0 BiocSingular_1.29.0 XVector_0.53.0
#> [79] spatstat.geom_3.8-1 RcppAnnoy_0.0.23 ggrepel_0.9.8
#> [82] RANN_2.6.2 pillar_1.11.1 stringr_1.6.0
#> [85] limma_3.69.1 spam_2.11-3 RcppHNSW_0.7.0
#> [88] later_1.4.8 splines_4.6.0 dplyr_1.2.1
#> [91] lattice_0.22-9 survival_3.8-6 deldir_2.0-4
#> [94] tidyselect_1.2.1 locfit_1.5-9.12 scuttle_1.23.1
#> [97] maketools_1.3.2 miniUI_0.1.2 pbapply_1.7-4
#> [100] transport_0.15-4 knitr_1.51 gridExtra_2.3
#> [103] edgeR_4.11.1 scattermore_1.2 xfun_0.57
#> [106] statmod_1.5.2 stringi_1.8.7 lazyeval_0.2.3
#> [109] yaml_2.3.12 evaluate_1.0.5 codetools_0.2-20
#> [112] tibble_3.3.1 BiocManager_1.30.27 cli_3.6.6
#> [115] uwot_0.2.4 xtable_1.8-8 reticulate_1.46.0
#> [118] jquerylib_0.1.4 Rcpp_1.1.1-1.1 globals_0.19.1
#> [121] spatstat.random_3.5-0 png_0.1-9 spatstat.univar_3.2-0
#> [124] parallel_4.6.0 ggplot2_4.0.3 SingleR_2.15.1
#> [127] dotCall64_1.2 scran_1.41.1 listenv_0.10.1
#> [130] viridisLite_0.4.3 scales_1.4.0 ggridges_0.5.7
#> [133] purrr_1.2.2 rlang_1.2.0 cowplot_1.2.0The manuscript describing these methods is currently in preparation. For questions or issues, please visit: