--- title: "scTypeEval: Evaluating Cell Type Labels Consistency in scRNA-seq" author: - name: Josep Garnica affiliation: University of Geneva email: josep.garnicacaparros@unige.ch date: "`r Sys.Date()`" output: BiocStyle::html_document: toc: true toc_depth: 3 toc_float: true number_sections: true vignette: > %\VignetteIndexEntry{scTypeEval: Evaluating Cell Type Classifications} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 5, warning = FALSE, message = FALSE ) ``` # Introduction Accurate cell type annotation is essential but challenging in single-cell RNA sequencing (scRNA-seq). **Manual approaches** are time-consuming, subjective, and inconsistent, while **automated classifiers** often fail to generalize across tissues, conditions, or closely related cell types. A key limitation for both is the **lack of true ground truth** for benchmarking. **`scTypeEval`** provides a **ground-truth-agnostic framework** to systematically assess annotation quality using internal validation metrics. It enables: - Quantification of **inter-sample label consistency** - Identification of **ambiguous or misclassified populations** - Reproducible **benchmarking of manual annotations, automated classifiers, or clustering results** ## Key Features - **Ground-truth agnostic** – Evaluate annotations without external references - **Cross-dataset benchmarking** – Assess consistency across samples and studies - **Customizable** – Works with Seurat, SingleCellExperiment, or matrices; supports custom gene lists - **Robust** – Sensitive to misclassification; reliable across batch effects, label granularity, and sample sizes # Quick Start Load the package: ```{r load_packages} library(scTypeEval) library(Matrix) ``` ## Generate Example Data For this vignette, we'll create a synthetic dataset representing single-cell RNA-seq data from multiple samples with distinct cell types. This simulates a typical scenario where we want to evaluate the consistency of cell type annotations across biological replicates. ```{r generate_data} set.seed(22) # Number of genes and cells n_genes <- 5000 n_cells_per_sample <- 500 n_samples <- 6 cell_types <- c("CD4_T", "CD8_T", "B_cells", "NK_cells") # Generate gene names gene_names <- paste0("Gene_", seq_len(n_genes)) # Total cells total_cells <- n_cells_per_sample * n_samples # Generate metadata: samples and cell types samples <- rep(paste0("Sample", seq_len(n_samples)), each = n_cells_per_sample) # Assign cell types evenly within each sample n_cell_types <- length(cell_types) cells_per_type <- n_cells_per_sample %/% n_cell_types celltypes <- rep(rep(cell_types, each = cells_per_type), times = n_samples) # Initialize count matrix count_matrix <- matrix(0, nrow = n_genes, ncol = total_cells) rownames(count_matrix) <- gene_names colnames(count_matrix) <- paste0("Cell_", seq_len(total_cells)) # Add cell-type-specific expression patterns for (i in seq_along(cell_types)) { ct <- cell_types[i] ct_indices <- which(celltypes == ct) # Marker genes for this cell type (50 genes per type) marker_start <- ((i - 1) * 50 + 1) marker_end <- min(marker_start + 49, n_genes) marker_genes <- marker_start:marker_end # High expression in marker genes (lambda = 50) count_matrix[marker_genes, ct_indices] <- matrix( rpois(length(marker_genes) * length(ct_indices), lambda = 50), nrow = length(marker_genes) ) # Background expression in other genes (lambda = 5) other_genes <- setdiff(seq_len(n_genes), marker_genes) count_matrix[other_genes, ct_indices] <- matrix( rpois(length(other_genes) * length(ct_indices), lambda = 5), nrow = length(other_genes) ) } # Add sample-specific batch effects for (s in seq_len(n_samples)) { sample_cells <- which(samples == paste0("Sample", s)) batch_effect <- rnorm(1, mean = 1, sd = 0.1) count_matrix[, sample_cells] <- round( count_matrix[, sample_cells] * batch_effect ) } # Convert to sparse matrix count_matrix <- Matrix(count_matrix, sparse = TRUE) # Create metadata dataframe metadata <- data.frame( celltype = celltypes, sample = samples, batch = rep( c("Batch1", "Batch2"), each = n_cells_per_sample * (n_samples/2) ), row.names = colnames(count_matrix), stringsAsFactors = FALSE ) # Preview the data head(metadata) dim(count_matrix) ``` # Core Workflow ## Step 1: Create scTypeEval Object `scTypeEval` objects can be created from: - A count matrix and metadata dataframe - A Seurat object - A SingleCellExperiment object ```{r create_object} # Create scTypeEval object from matrix and metadata sceval <- create_scTypeEval( matrix = count_matrix, metadata = metadata ) ``` The `scTypeEval` object stores the raw count data and metadata, which will be processed in subsequent steps. ## Step 2: Process Data Process and normalize the data, creating both single-cell and pseudobulk representations. This step: - Aggregates counts by cell type and sample (pseudobulk) - Filters out cell types with insufficient samples or cells - Normalizes the data for downstream analysis ```{r process_data} # Process data with filtering criteria sceval <- run_processing_data( scTypeEval = sceval, ident = "celltype", # Column defining cell type identities sample = "sample", # Column defining sample IDs min_samples = 3, # Minimum samples required per cell type min_cells = 10 # Minimum cells per sample-celltype combination ) ``` The processed data is stored as `data_assay` objects within the `scTypeEval` object, accessible for downstream analysis. ## Step 3: Extract Relevant Features Identify biologically relevant features for evaluating consistency. `scTypeEval` supports: - **Highly Variable Genes (HVGs)**: Genes with high variability across the dataset - **Cell Type Marker Genes**: Genes differentially expressed in each cell type - **Custom Gene Lists**: User-defined gene sets ### Highly Variable Genes ```{r hvg} # Identify HVGs using the basic method sceval <- run_hvg( scTypeEval = sceval, var_method = "basic", # Method for variance modeling: "basic" or "scran" ngenes = 1000, # Number of HVGs to retain sample = TRUE # Perform sample-level blocking ) # View stored gene lists gene_lists <- sceval@gene_lists names(gene_lists) if ("HVG" %in% names(gene_lists)) { length(gene_lists$HVG) } ``` ### Cell Type Marker Genes ```{r markers} # Identify marker genes per cell type sceval <- run_gene_markers( scTypeEval = sceval, method = "scran.findMarkers", # Method for finding markers ngenes_celltype = 50 # Maximum markers per cell type ) # View marker genes gene_lists <- sceval@gene_lists if (!is.null(gene_lists$scran.findMarkers)) { head(gene_lists$scran.findMarkers, 20) } ``` ### Custom Gene Lists (Optional) You can also add custom gene lists using `add_gene_list()`: ```{r custom_genes} # Example: Add a custom list of immune response genes custom_genes <- c("Gene_1", "Gene_50", "Gene_100", "Gene_150") sceval <- add_gene_list( scTypeEval = sceval, gene_list = list("custom_genes" = custom_genes) ) ``` These different selected features are stored within the scTypeEval object and can be used in the subsequent steps. ```{r explore_gene_lists} gene_lists <- sceval@gene_lists names(gene_lists) ``` ## Step 4: Dimensional Reduction (Optional but Recommended) Consistency metrics can be computed directly on the selected features. However, computing them in a **low-dimensional space** (e.g., PCA) significantly speeds up the analysis while yielding similar results. ```{r pca} # Run PCA on processed data sceval <- run_pca( scTypeEval = sceval, ndim = 20, # Number of principal components gene_list = "HVG" # specify gene list to subset if multiple were added ) # View available reductions reductions <- sceval@reductions if (length(reductions) > 0) { names(reductions) } ``` Alternatively, pre-computed embeddings (PCA, UMAP, t-SNE) can be inserted using `add_dim_reduction()`. ## Step 5: Compute Dissimilarity Matrices The core of `scTypeEval` is computing pairwise dissimilarities between cell types across samples. Multiple dissimilarity methods are supported: ### Pseudobulk-based Distances Compute distances between pseudobulk gene expression profiles: ```{r dissim_pseudobulk} # Euclidean distance on pseudobulk profiles sceval <- run_dissimilarity( sceval, method = "Pseudobulk:Euclidean", reduction = FALSE # Use processed expression data of selected features ) # Cosine distance on PCA embeddings sceval <- run_dissimilarity( sceval, method = "Pseudobulk:Cosine", reduction = TRUE # Use PCA space ) ``` ### Wasserstein Distance Compute Wasserstein distances between single-cell distributions: ```{r dissim_wasserstein} # Wasserstein distance on PCA embeddings (faster) sceval <- run_dissimilarity( sceval, method = "WasserStein", reduction = TRUE ) ``` ### Reciprocal Classification Match cells across samples using a classifier: ```{r dissim_reciprocal} # Reciprocal classification with SingleR sceval <- run_dissimilarity( sceval, method = "recip_classif:Match", reciprocal_classifier = "SingleR", # usage of dimensional reduction not supported # for this dissimilarity approach reduction = FALSE ) ``` ### View Available Dissimilarity Matrices ```{r view_dissim} # List all computed dissimilarity matrices dissim <- sceval@dissimilarity if (length(dissim) > 0) { names(dissim) } ``` ## Step 6: Compute Consistency Metrics Evaluate inter-sample consistency for each cell type using internal validation metrics. `scTypeEval` supports multiple consistency metrics: - **silhouette**: Standard cohesion/separation score - **2label_silhouette**: Silhouette comparing "own type" vs. all others - **NeighborhoodPurity**: Fraction of K-nearest neighbors with same label - **ward_PropMatch**: Proportion in dominant Ward cluster - **Orbital_medoid**: Fraction closer to own medoid than others - **Average_similarity**: Within-type similarity vs. between-type ### Compute Silhouette Scores ```{r consistency_silhouette} # Compute silhouette consistency on Euclidean dissimilarity consistency_sil <- get_consistency( sceval, dissimilarity_slot = "Pseudobulk:Euclidean", consistency_metric = "silhouette" ) # View results print(consistency_sil) ``` Higher values indicate better consistency (samples with the same cell type annotation are more similar to each other than to samples with different annotations). ### Compute Neighborhood Purity ```{r consistency_neighborhood} # Compute neighborhood purity on Wasserstein dissimilarity consistency_nbhd <- get_consistency( sceval, dissimilarity_slot = "WasserStein", consistency_metric = "NeighborhoodPurity" ) # View results print(consistency_nbhd) ``` ### Compare Multiple Metrics ```{r compare_metrics} consistency_all <- get_consistency( sceval, dissimilarity_slot = c("Pseudobulk:Euclidean", "WasserStein"), # allow multiple arguments consistency_metric = c("silhouette", "NeighborhoodPurity", "Average_similarity") # allow multiple arguments ) print(consistency_all) ``` # Visualization ## Dissimilarity Heatmap Visualize dissimilarity matrices as annotated heatmaps. Cell types can be ordered by similarity or consistency scores: ```{r plot_heatmap, fig.width=8, fig.height=7} # Plot dissimilarity heatmap with consistency-based ordering plot_heatmap( sceval, dissimilarity_slot = "Pseudobulk:Euclidean", sort_consistency = "silhouette", # Order by silhouette scores sort_similarity = NULL # Or order by similarity ) ``` The heatmap shows pairwise dissimilarities between samples, grouped by cell type. Cell types with higher consistency (e.g., more coherent annotations) show tighter clustering of their samples. ## Pseudobulk PCA Pseudobulk PCA per sample & cell type ```{r pca_pseudobulk} plot_pca( sceval, reduction_slot = "pseudobulk" ) ``` # Interpretation Guidelines ## Identifying Problematic Annotations Cell types with low consistency scores may indicate: 1. **Ambiguous cell type boundaries**: Related cell types (e.g., activated vs. resting T cells) may overlap 2. **Heterogeneous populations**: A single label covering multiple biological states 3. **Annotation errors**: Inconsistent labeling across samples ## Recommendations - Investigate cell types with low consistency using the heatmap visualization - Consider refining annotations for cell types with negative consistency scores - Use biological knowledge to interpret whether low consistency reflects true biology or annotation issues # Session Information ```{r session_info} sessionInfo() ``` # References The manuscript describing these methods is currently in preparation. For questions or issues, please visit: - GitHub: https://github.com/carmonalab/scTypeEval - Bug reports: https://github.com/carmonalab/scTypeEval/issues