---
title: "scTypeEval: Evaluating Cell Type Labels Consistency in scRNA-seq"
author:
  - name: Josep Garnica
    affiliation: University of Geneva
    email: josep.garnicacaparros@unige.ch
date: "`r Sys.Date()`"
output:
  BiocStyle::html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    number_sections: true
vignette: >
  %\VignetteIndexEntry{scTypeEval: Evaluating Cell Type Classifications}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5,
  warning = FALSE,
  message = FALSE
)
```

# Introduction

Accurate cell type annotation is essential but challenging in single-cell RNA
sequencing (scRNA-seq). **Manual approaches** are time-consuming, subjective,
and inconsistent, while **automated classifiers** often fail to generalize
across tissues, conditions, or closely related cell types. A key limitation for
both is the **lack of true ground truth** for benchmarking.

**`scTypeEval`** provides a **ground-truth-agnostic framework** to
systematically assess annotation quality using internal validation metrics. It
enables:

- Quantification of **inter-sample label consistency**
- Identification of **ambiguous or misclassified populations**
- Reproducible **benchmarking of manual annotations, automated classifiers, or
  clustering results**

## Key Features

- **Ground-truth agnostic** – Evaluate annotations without external references
- **Cross-dataset benchmarking** – Assess consistency across samples and studies
- **Customizable** – Works with Seurat, SingleCellExperiment, or matrices;
  supports custom gene lists
- **Robust** – Sensitive to misclassification; reliable across batch effects,
  label granularity, and sample sizes


# Quick Start

Load the package:

```{r load_packages}
library(scTypeEval)
library(Matrix)
```

## Generate Example Data

For this vignette, we'll create a synthetic dataset representing single-cell
RNA-seq data from multiple samples with distinct cell types. This simulates a
typical scenario where we want to evaluate the consistency of cell type
annotations across biological replicates.

```{r generate_data}
set.seed(22)

# Number of genes and cells
n_genes <- 5000
n_cells_per_sample <- 500
n_samples <- 6
cell_types <- c("CD4_T", "CD8_T", "B_cells", "NK_cells")

# Generate gene names
gene_names <- paste0("Gene_", seq_len(n_genes))

# Total cells
total_cells <- n_cells_per_sample * n_samples

# Generate metadata: samples and cell types
samples <- rep(paste0("Sample", seq_len(n_samples)), each = n_cells_per_sample)

# Assign cell types evenly within each sample
n_cell_types <- length(cell_types)
cells_per_type <- n_cells_per_sample %/% n_cell_types
celltypes <- rep(rep(cell_types, each = cells_per_type), times = n_samples)

# Initialize count matrix
count_matrix <- matrix(0, nrow = n_genes, ncol = total_cells)
rownames(count_matrix) <- gene_names
colnames(count_matrix) <- paste0("Cell_", seq_len(total_cells))

# Add cell-type-specific expression patterns
for (i in seq_along(cell_types)) {
  ct <- cell_types[i]
  ct_indices <- which(celltypes == ct)
  
  # Marker genes for this cell type (50 genes per type)
  marker_start <- ((i - 1) * 50 + 1)
  marker_end <- min(marker_start + 49, n_genes)
  marker_genes <- marker_start:marker_end
  
  # High expression in marker genes (lambda = 50)
  count_matrix[marker_genes, ct_indices] <- matrix(
    rpois(length(marker_genes) * length(ct_indices), lambda = 50),
    nrow = length(marker_genes)
  )
  
  # Background expression in other genes (lambda = 5)
  other_genes <- setdiff(seq_len(n_genes), marker_genes)
  count_matrix[other_genes, ct_indices] <- matrix(
    rpois(length(other_genes) * length(ct_indices), lambda = 5),
    nrow = length(other_genes)
  )
}

# Add sample-specific batch effects
for (s in seq_len(n_samples)) {
  sample_cells <- which(samples == paste0("Sample", s))
  batch_effect <- rnorm(1, mean = 1, sd = 0.1)
  count_matrix[, sample_cells] <- round(
    count_matrix[, sample_cells] * batch_effect
  )
}

# Convert to sparse matrix
count_matrix <- Matrix(count_matrix, sparse = TRUE)

# Create metadata dataframe
metadata <- data.frame(
  celltype = celltypes,
  sample = samples,
  batch = rep(
    c("Batch1", "Batch2"),
    each = n_cells_per_sample * (n_samples/2)
  ),
  row.names = colnames(count_matrix),
  stringsAsFactors = FALSE
)

# Preview the data
head(metadata)
dim(count_matrix)
```

# Core Workflow

## Step 1: Create scTypeEval Object

`scTypeEval` objects can be created from:

- A count matrix and metadata dataframe
- A Seurat object
- A SingleCellExperiment object

```{r create_object}
# Create scTypeEval object from matrix and metadata
sceval <- create_scTypeEval(
  matrix = count_matrix,
  metadata = metadata
)

```

The `scTypeEval` object stores the raw count data and metadata, which will be
processed in subsequent steps.

## Step 2: Process Data

Process and normalize the data, creating both single-cell and pseudobulk
representations. This step:

- Aggregates counts by cell type and sample (pseudobulk)
- Filters out cell types with insufficient samples or cells
- Normalizes the data for downstream analysis

```{r process_data}
# Process data with filtering criteria
sceval <- run_processing_data(
  scTypeEval = sceval,
  ident = "celltype",          # Column defining cell type identities
  sample = "sample",           # Column defining sample IDs
  min_samples = 3,             # Minimum samples required per cell type
  min_cells = 10               # Minimum cells per sample-celltype combination
)

```

The processed data is stored as `data_assay` objects within the `scTypeEval`
object, accessible for downstream analysis.

## Step 3: Extract Relevant Features

Identify biologically relevant features for evaluating consistency.
`scTypeEval` supports:

- **Highly Variable Genes (HVGs)**: Genes with high variability across the
  dataset
- **Cell Type Marker Genes**: Genes differentially expressed in each cell type
- **Custom Gene Lists**: User-defined gene sets

### Highly Variable Genes

```{r hvg}
# Identify HVGs using the basic method
sceval <- run_hvg(
  scTypeEval = sceval,
  var_method = "basic", # Method for variance modeling: "basic" or "scran"
  ngenes = 1000,        # Number of HVGs to retain
  sample = TRUE         # Perform sample-level blocking
)

# View stored gene lists
gene_lists <- sceval@gene_lists
names(gene_lists)
if ("HVG" %in% names(gene_lists)) {
  length(gene_lists$HVG)
}
```

### Cell Type Marker Genes

```{r markers}
# Identify marker genes per cell type
sceval <- run_gene_markers(
  scTypeEval = sceval,
  method = "scran.findMarkers",  # Method for finding markers
  ngenes_celltype = 50           # Maximum markers per cell type
)

# View marker genes
gene_lists <- sceval@gene_lists
if (!is.null(gene_lists$scran.findMarkers)) {
  head(gene_lists$scran.findMarkers, 20)
}
```

### Custom Gene Lists (Optional)

You can also add custom gene lists using `add_gene_list()`:

```{r custom_genes}
# Example: Add a custom list of immune response genes
custom_genes <- c("Gene_1", "Gene_50", "Gene_100", "Gene_150")

sceval <- add_gene_list(
  scTypeEval = sceval,
  gene_list = list("custom_genes" = custom_genes)
)
```

These different selected features are stored within the scTypeEval object and can be used in the subsequent steps.

```{r explore_gene_lists}
gene_lists <- sceval@gene_lists
names(gene_lists)
```


## Step 4: Dimensional Reduction (Optional but Recommended)

Consistency metrics can be computed directly on the selected features.
However, computing them in a **low-dimensional space** (e.g., PCA)
significantly speeds up the analysis while yielding similar results.

```{r pca}
# Run PCA on processed data
sceval <- run_pca(
  scTypeEval = sceval,
  ndim = 20,         # Number of principal components
  gene_list = "HVG" # specify gene list to subset if multiple were added
)

# View available reductions
reductions <- sceval@reductions
if (length(reductions) > 0) {
  names(reductions)
}
```

Alternatively, pre-computed embeddings (PCA, UMAP, t-SNE) can be inserted using
`add_dim_reduction()`.

## Step 5: Compute Dissimilarity Matrices

The core of `scTypeEval` is computing pairwise dissimilarities between cell
types across samples. Multiple dissimilarity methods are supported:

### Pseudobulk-based Distances

Compute distances between pseudobulk gene expression profiles:

```{r dissim_pseudobulk}
# Euclidean distance on pseudobulk profiles
sceval <- run_dissimilarity(
  sceval,
  method = "Pseudobulk:Euclidean",
  reduction = FALSE # Use processed expression data of selected features
)

# Cosine distance on PCA embeddings
sceval <- run_dissimilarity(
  sceval,
  method = "Pseudobulk:Cosine",
  reduction = TRUE               # Use PCA space
)
```

### Wasserstein Distance

Compute Wasserstein distances between single-cell distributions:

```{r dissim_wasserstein}
# Wasserstein distance on PCA embeddings (faster)
sceval <- run_dissimilarity(
  sceval,
  method = "WasserStein",
  reduction = TRUE
)
```

### Reciprocal Classification

Match cells across samples using a classifier:

```{r dissim_reciprocal}
# Reciprocal classification with SingleR
sceval <- run_dissimilarity(
  sceval,
  method = "recip_classif:Match",
  reciprocal_classifier = "SingleR",
  # usage of dimensional reduction not supported
  # for this dissimilarity approach
  reduction = FALSE 
)
```

### View Available Dissimilarity Matrices

```{r view_dissim}
# List all computed dissimilarity matrices
dissim <- sceval@dissimilarity
if (length(dissim) > 0) {
  names(dissim)
}
```

## Step 6: Compute Consistency Metrics

Evaluate inter-sample consistency for each cell type using internal validation
metrics. `scTypeEval` supports multiple consistency metrics:

- **silhouette**: Standard cohesion/separation score
- **2label_silhouette**: Silhouette comparing "own type" vs. all others
- **NeighborhoodPurity**: Fraction of K-nearest neighbors with same label
- **ward_PropMatch**: Proportion in dominant Ward cluster
- **Orbital_medoid**: Fraction closer to own medoid than others
- **Average_similarity**: Within-type similarity vs. between-type

### Compute Silhouette Scores

```{r consistency_silhouette}
# Compute silhouette consistency on Euclidean dissimilarity
consistency_sil <- get_consistency(
  sceval,
  dissimilarity_slot = "Pseudobulk:Euclidean",
  consistency_metric = "silhouette"
)

# View results
print(consistency_sil)
```

Higher values indicate better consistency (samples with the same cell type
annotation are more similar to each other than to samples with different
annotations).

### Compute Neighborhood Purity

```{r consistency_neighborhood}
# Compute neighborhood purity on Wasserstein dissimilarity
consistency_nbhd <- get_consistency(
  sceval,
  dissimilarity_slot = "WasserStein",
  consistency_metric = "NeighborhoodPurity"
)

# View results
print(consistency_nbhd)
```

### Compare Multiple Metrics

```{r compare_metrics}
consistency_all <- 
  get_consistency(
    sceval,
    dissimilarity_slot = c("Pseudobulk:Euclidean",
                           "WasserStein"), # allow multiple arguments
    consistency_metric = c("silhouette",
                           "NeighborhoodPurity",
                           "Average_similarity") # allow multiple arguments
  )

print(consistency_all)
```

# Visualization

## Dissimilarity Heatmap

Visualize dissimilarity matrices as annotated heatmaps. Cell types can be
ordered by similarity or consistency scores:

```{r plot_heatmap, fig.width=8, fig.height=7}
# Plot dissimilarity heatmap with consistency-based ordering
plot_heatmap(
  sceval,
  dissimilarity_slot = "Pseudobulk:Euclidean",
  sort_consistency = "silhouette",      # Order by silhouette scores
  sort_similarity = NULL                 # Or order by similarity
)
```

The heatmap shows pairwise dissimilarities between samples, grouped by cell
type. Cell types with higher consistency (e.g., more coherent annotations)
show tighter clustering of their samples.

## Pseudobulk PCA

Pseudobulk PCA per sample & cell type

```{r pca_pseudobulk}
plot_pca(
  sceval,
  reduction_slot = "pseudobulk"
)
```


# Interpretation Guidelines

## Identifying Problematic Annotations

Cell types with low consistency scores may indicate:

1. **Ambiguous cell type boundaries**: Related cell types (e.g., activated vs.
  resting T cells) may overlap
2. **Heterogeneous populations**: A single label covering multiple biological
  states
3. **Annotation errors**: Inconsistent labeling across samples

## Recommendations

- Investigate cell types with low consistency using the heatmap visualization
- Consider refining annotations for cell types with negative consistency scores
- Use biological knowledge to interpret whether low consistency reflects true
  biology or annotation issues

# Session Information

```{r session_info}
sessionInfo()
```

# References

The manuscript describing these methods is currently in preparation. For
questions or issues, please visit:

- GitHub: https://github.com/carmonalab/scTypeEval
- Bug reports: https://github.com/carmonalab/scTypeEval/issues